from:"\"Justin Pryzby\""

Re: Bad estimates (DEFAULT_UNK_SEL)

2017-11-22 Thread Justin Pryzby

On Wed, Nov 22, 2017 at 03:29:54PM +0100, Artur Zając wrote:
> CREATE TABLE xyz AS SELECT generate_series(1,1000,1) AS gs;
> 
> db=# explain analyze select * from xyz where gs&1=1;
>  Seq Scan on xyz  (cost=0.00..260815.38 rows=68920 width=4) (actual 
> time=0.044..2959.728 rows=500 loops=1)
...
> newrr=# explain analyze select * from xyz where gs&1=1 and gs&2=2 and gs&4=4;
>  Seq Scan on xyz  (cost=0.00..398655.62 rows=2 width=4) (actual 
> time=0.052..3329.422 rows=125 loops=1)

> I noticed that each additional clause reduces the number about 200 times and
> define DEFAULT_NUM_DISTINCT is responsible for this behaviur.

I think it's actually:

src/include/utils/selfuncs.h-/* default selectivity estimate for boolean and 
null test nodes */
src/include/utils/selfuncs.h-#define DEFAULT_UNK_SEL0.005

..which is 1/200.

Note, you can do this, which helps a bit by collecting stats for the index
expr:

postgres=# CREATE INDEX ON xyz((gs&1));
postgres=# ANALYZE xyz;
postgres=# explain analyze SELECT * FROM xyz WHERE gs&1=1 AND gs&2=2 AND gs&4=4;
 Bitmap Heap Scan on xyz  (cost=91643.59..259941.99 rows=124 width=4) (actual 
time=472.376..2294.035 rows=125 loops=1)
   Recheck Cond: ((gs & 1) = 1)
   Filter: (((gs & 2) = 2) AND ((gs & 4) = 4))
   Rows Removed by Filter: 375
   Heap Blocks: exact=44248
   ->  Bitmap Index Scan on xyz_expr_idx  (cost=0.00..91643.55 rows=4962016 
width=0) (actual time=463.477..463.477 rows=500 loops=1)
 Index Cond: ((gs & 1) = 1)

Justin

Re: Bitmap scan is undercosted?

2017-12-01 Thread Justin Pryzby

On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> We recently had an issue in production, where a bitmap scan was chosen
> instead of an index scan. Despite being 30x slower, the bitmap scan had
> about the same cost as the index scan.

Me too, see also:
https://www.postgresql.org/message-id/flat/CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA%40mail.gmail.com#cah2-wzkrtggiy_lkquu-oviyp6y_hhz-a1ywacpy4tcywv1...@mail.gmail.com

> drop table if exists aaa;
> create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
> generate_series(1, 1000) id;
> create index i1 on aaa  (num);
> create index i2 on aaa  (flag);
> analyze aaa;
> 
> select relname, reltuples::bigint, relpages::bigint,
> (reltuples/relpages)::bigint tpp from pg_class where relname
> in('aaa','i1','i2') order by relname;
> "aaa";985;44248;226
> "i1";985;27422;365
> "i2";985;27422;365
> 
> The query was:
> explain (analyze,verbose,costs,buffers)
> select count(*) from aaa where num = 1 and flag = true;

Note that id%100==1 implies flag='t', so the planner anticipates retrieving
fewer rows than it will ultimately read, probably by 2x.  It makes sense that
causes the index scan to be more expensive than expected, but that's only
somewhat important, since there's no joins involved.

The reason why it's more than a bit slower is due to the "density" [0] of the
heap pages read.  num=1 is more selective than flag=true, so it scans i1,
reading 1% of the whole table.  But it's not reading the first 1% or 
some other 1% of the table, it reads tuples evenly distributed across the
entire table (226*0.01 = ~2 rows of each page).  Since the index was created
after the INSERT, the repeated keys (logical value: id%100) are read in
physical order on the heap, so this is basically doing a seq scan, but with the
additional overhead of reading the index, and maybe doing an fseek() before
each/some read()s, too.  You could confirm that by connecting strace to the
backend before starting the query.

Since you did that using % and with indices added after the INSERT, you can't
improve it by reindexing (as I was able to for my case).  That's an elegant
test case, so thanks.

I think shared_buffers=512MB is just small enough for this test to be painful
for 1e7 rows.  I see the table+index is 559MB.

I don't know if that's really similar to your production use case, but I would
recommend trying BRIN indices, which always require a bitmap scan.  Note that
some things (like max()) that can use an btree index cannot use brin.  PG10.1
has WITH autosummarize, which was important for our use, since we rarely do
UPDATEs or DELETEs so tables are rarely vacuumed (only analyzed).

Justin

[0] I'm borrowing Jeff's language from here:
https://www.postgresql.org/message-id/CAMkU%3D1xwGn%2BO0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q%40mail.gmail.com
"density" wasn't our problem, but it's a perfect description of this issue.

Re: Bitmap scan is undercosted?

2017-12-01 Thread Justin Pryzby

I tried to reproduce this issue and couldn't, under PG95 and 10.1:

On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> > We recently had an issue in production, where a bitmap scan was chosen
> > instead of an index scan. Despite being 30x slower, the bitmap scan had
> > about the same cost as the index scan.
> 
> > drop table if exists aaa;
> > create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
> > generate_series(1, 1000) id;
> > create index i1 on aaa  (num);
> > create index i2 on aaa  (flag);
> > analyze aaa;

What is:
effective_io_concurrency
max_parallel_workers_per_gather (I gather you don't have this)

Note:
postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND 
attname='num';
correlation | 0.00710112

..so this is different from the issue corrected by the patch I created while
testing.

> Note that id%100==1 implies flag='t', so the planner anticipates retrieving
> fewer rows than it will ultimately read, probably by 2x.  It makes sense that
> causes the index scan to be more expensive than expected, but that's only
> somewhat important, since there's no joins involved.

I changed the query from COUNT(*) TO * for easier to read explain:

CREATE TABLE aaa AS SELECT (id%100)::int num, (id%10=1)::bool flag FROM 
generate_series(1, 1000) id;
CREATE INDEX i1 ON aaa(num);
CREATE INDEX i2 ON aaa (flag);
ANALYZE VERBOSE aaa;
EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND 
flag=true;
 Bitmap Heap Scan on public.aaa  (cost=20652.98..45751.75 rows=10754 width=5) 
(actual time=85.314..185.107 rows=10 loops=1)
   ->  BitmapAnd  (cost=20652.98..20652.98 rows=10754 width=0) (actual 
time=163.220..163.220 rows=0 loops=1)
 ->  Bitmap Index Scan on i1  (cost=0.00..1965.93 rows=106333 width=0) 
(actual time=26.943..26.943 rows=10 loops=1)
 ->  Bitmap Index Scan on i2  (cost=0.00..18681.42 rows=1011332 
width=0) (actual time=133.804..133.804 rows=100 loops=1)

..which is what's wanted with no planner hints (PG10.1 here).

Same on PG95:
postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE 
num=1 AND flag=true;
 Bitmap Heap Scan on public.aaa  (cost=19755.64..43640.32 rows=9979 width=5) 
(actual time=230.017..336.583 rows=10 loops=1)
   ->  BitmapAnd  (cost=19755.64..19755.64 rows=9979 width=0) (actual 
time=205.242..205.242 rows=0 loops=1)
 ->  Bitmap Index Scan on i1  (cost=0.00..1911.44 rows=103334 width=0) 
(actual time=24.911..24.911 rows=10 loops=1)
 ->  Bitmap Index Scan on i2  (cost=0.00..17838.96 rows=965670 width=0) 
(actual time=154.237..154.237 rows=100 loops=1)

The rowcount is off, but not a critical issue without a join.

Justin

Re: Bitmap scan is undercosted?

2017-12-01 Thread Justin Pryzby

On Fri, Dec 01, 2017 at 05:11:04PM -0600, Justin Pryzby wrote:
> I tried to reproduce this issue and couldn't, under PG95 and 10.1:

I'm embarassed to say that I mis-read your message, despite you're amply clear
subject.  You're getting a bitmap scan but you'd prefer to get an index scan.
I anticipated the opposite problem (which is what I've had issues with myself).

> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
> > On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
> > > We recently had an issue in production, where a bitmap scan was chosen
> > > instead of an index scan. Despite being 30x slower, the bitmap scan had
> > > about the same cost as the index scan.
> 
> Note:
> postgres=# SELECT correlation FROM pg_stats WHERE tablename='aaa' AND 
> attname='num';
> correlation | 0.00710112
> 
> ..so this is different from the issue corrected by the patch I created while
> testing.

Actually, that the table is "not correlated" on "num" column is maybe the
primary reason why PG avoids using an index scan.  It (more or less correctly)
deduces that it's going to have to "read" a large fraction of the pages (even
if only to process a small fraction of the rows), which is costly, except it's
all cached..  In your case, that overly-penalizes the index scan.

This is cost_index() and cost_bitmap_heap_scan() in costsize.c.  Since the
index is uncorrelated, it's returning something close to max_IO_cost.  It looks
like effective_cache_size only affects index_pages_fetched().

I'm going to try to dig some more into it.  Maybe there's evidence to
re-evaluate one of these:

cost_index()
| run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
or
cost_bitmap_heap_scan()
| cost_per_page = spc_random_page_cost - 
|   (spc_random_page_cost - spc_seq_page_cost)
|   * sqrt(pages_fetched / T);

Justin

Re: Bitmap scan is undercosted?

2017-12-01 Thread Justin Pryzby

On Sat, Dec 02, 2017 at 01:54:09AM +0200, Vitaliy Garnashevich wrote:
> On 02/12/2017 01:11, Justin Pryzby wrote:
> >..which is what's wanted with no planner hints (PG10.1 here).
> Yes, that's what you get without planner hints, but it's strange to get this
> plan, when there is another one, which runs 2-3 times faster, but happens to
> be estimated to be twice more costly than the one with bitmap scans:
> 
> # set enable_bitmapscan = off; set enable_indexscan = on;  set enable_seqscan 
> = off;
> # explain analyze select * from aaa where num = 1 and flag = true;
> Index Scan using i1 on aaa  (cost=0.44..66369.81 rows=10428 width=5) (actual 
> time=0.020..57.765 rows=10 loops=1)
> 
> vs.
> 
> # set enable_bitmapscan = on;  set enable_indexscan = off; set enable_seqscan 
> = off;
> # explain analyze select * from aaa where num = 1 and flag = true;
> Bitmap Heap Scan on aaa  (cost=13099.33..25081.40 rows=10428 width=5) (actual 
> time=122.137..182.811 rows=10 loops=1)

I was able to get an index plan with:

SET random_page_cost=1; SET cpu_index_tuple_cost=.04; -- default: 0.005; see 
selfuncs.c
postgres=# EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE 
num=1 AND flag=true; 
 Index Scan using i1 on public.aaa  (cost=0.43..50120.71 rows=10754 width=5) 
(actual time=0.040..149.580 rows=10 loops=1)

Or with:
SET random_page_cost=1; SET cpu_operator_cost=0.03; -- default: 0.0025 see 
cost_bitmap_tree_node()
EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM aaa WHERE num=1 AND flag= 
true;  
 Index Scan using i1 on public.aaa  (cost=5.22..49328.00 rows=10754 width=5) 
(actual time=0.051..109.082 rows=10 loops=1)

Or a combination trying to minimize the cost of the index scan:
postgres=# SET random_page_cost=1; SET cpu_index_tuple_cost=.0017; SET 
cpu_operator_cost=0.03; EXPLAIN (analyze,verbose,costs,buffers) SELECT * FROM 
aaa WHERE num=1 AND flag= true;  
 Index Scan using i1 on public.aaa  (cost=5.22..48977.10 rows=10754 width=5) 
(actual time=0.032..86.883 rows=10 loops=1)

Not sure if that's reasonable, but maybe it helps to understand.

Justin

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-02 Thread Justin Pryzby

On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> I think the non-extended stats code also has trouble with booleans.
> pg_stats gives me a correlation  of 0.8 or higher for the flag column.

It's not due to the boolean though; you see the same thing if you do:
CREATE INDEX aaa_f ON aaa((flag::text));
ANALYZE aaa;
correlation | 0.81193

or:
ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
correlation | 0.81193

I think it's caused by having so few (2) values to correlate.

most_common_vals   | {f,t}
most_common_freqs  | {0.9014,0.0986}
correlation| 0.822792

It thinks there's somewhat-high correlation since it gets a list of x and y
values (integer positions by logical and physical sort order) and 90% of the x
list (logical value) are the same value ('t'), and the CTIDs are in order on
the new index, so 90% of the values are 100% correlated.

It improves (by which I mean here that it spits out a lower number) if it's not
a 90/10 split:

CREATE TABLE aaa5 AS SELECT (id%100)::int num, (id%10>5)::bool flag FROM 
generate_series(1, 1000) id;
CREATE INDEX ON aaa5 (flag);

tablename   | aaa5
attname | flag
correlation | 0.522184

Justin

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-06 Thread Justin Pryzby

On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> Jeff Janes  writes:
> > On Dec 3, 2017 15:31, "Tom Lane"  wrote:
> >> Jeff Janes  writes:
> >>> But I do see that ties within the logical order of the column values are
> >>> broken to agree with the physical order.  That is wrong, right?  Is there
> >>> any argument that this is desirable?
> 
> >> Uh ... what do you propose doing instead?  We'd have to do something with
> >> ties, and it's not so obvious this way is wrong.
> 
> > Let them be tied.
...
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.  If the values we've sampled are all unequal then there's no
> particular issue.  However, if we have some group of equal values, we
> do not really know what order an indexscan will visit them in.  The
> existing correlation calculation is making the *most optimistic possible*
> assumption, that such a group will be visited exactly in heap order ---
> and that assumption isn't too defensible.

I'm interested in discusstion regarding bitmap cost, since it would have helped
our case discussed here ~18 months ago:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com

...but remember: in Vitaliy's case (as opposed to mine), the index scan is
*faster* but being estimated at higher cost than bitmap (I have to keep
reminding myself).  So the rest of this discussion is about the
overly-optimistic cost estimate of index scans, which moves in the opposite
direction for this reported problem.  For the test cases I looked at, index
scans were used when RPC=1 and redundant conditions were avoided, so I'm not
sure if there's any remaining issue (but I haven't looked at the latest cases
Vitaliy sent).

> In any case, given that we do this calculation without regard
> to any specific index,

One solution is to compute stats (at least correlation) for all indices, not
just expr inds.  I did that earlier this year while throwing around/out ideas.
https://www.postgresql.org/message-id/20170707234119.GN17566%40telsasoft.com

> We do have an idea, from the data we have, whether the duplicates are close
> together in the heap or spread all over.

I think you just mean pg_stats.correlation for all values, not just duplicates
(with the understanding that duplicates might be a large fraction of the
tuples, and high weight in correlation).

Another issue I noted in an earlier thread is that as table sizes increase, the
existing correlation computation approaches 1 for correlated insertions, (like
"append-only" timestamps clustered around now()), due to ANALYZE sampling a
fraction of the table, and thereby representing only large-scale correlation,
and, to an increasing degree, failing to represent small-scale variations
between adjacent index TIDs, which has real cost (and for which the mitigation
by cache effects probably decreases WRT table size, too).  I think any solution
needs to handle this somehow.

Generated data demonstrating this (I reused this query so it's more complicated
than it needs to be):

[pryzbyj@database ~]$ time for sz in {,9{,9{,9{,9 ; do psql postgres 
-tc "DROP TABLE IF EXISTS t; CREATE TABLE t(i float, j int); CREATE INDEX ON 
t(i);INSERT INTO t SELECT i/9.0+pow(2,(-random())) FROM 
generate_series(1,$sz) i ORDER BY i; ANALYZE t; SELECT $sz, correlation, 
most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'"; done

  |0.187146 |  
9 |0.900629 |  
   99 |0.998772 |  
  999 |0.87 |  

Trying to keep it all in my own head: For sufficiently large number of pages,
bitmap scan should be preferred to idx scan due to reduced random-page-cost
outweighing its overhead in CPU cost.  Probably by penalizing index scans, not
discounting bitmap scans.  Conceivably a correlation adjustment can be
conditionalized or weighted based on index_pages_fetched() ...
x = ln (x/99);
if (x>1) correlation/=x;

I think one could look at the fraction of duplicated index keys expected to be
returned: if we expect to return 1000 tuples, with 200 duplicates from MCV,
cost_index would multiply correlation by (1 - 200/1000), meaning to use
something closer to max_IO_cost rather than min_IO_cost.  I imagine it'd be
possible to limit to only those MCVs which pass quals - if none pass, then
there may be few tuples returned, so apply no correction to (additionally)
penalize index scan.

In my tests, at one point I implemented idx_corr_fudge(), returning a value
like "fragmentation" from pgstatindex (which I'm sure is where I got the phrase
when reporting the problem).  That only uses the leaf nodes' "next" pointer,
and not the individual tuples, which probably works if there's a sufficiently
number of repeated keys.

I think that's all for now..

Justin

Re: CPU 100% usage caused by unknown postgres process..

2017-12-13 Thread Justin Pryzby

On Wed, Dec 13, 2017 at 11:36:11AM +0100, Laurenz Albe wrote:
> Dinesh Chandra 12108 wrote:
> > My CPU utilization is going to 100% in PostgreSQL because of one unknown 
> > process /x330341 is running from postgres user.
> >  
> > PID   USER  PR  NIVIRTRESSHR S %CPU %MEM   TIME+   COMMAND
> > 19885 postgres 20   0  192684   3916   1420 S 99.3  0.1   5689:04  
> > x330341 
> >  
> > The same file is automatically created in Postgres Cluster also. I am using 
> > Postgresql-9.3.
> >  
> > Kindly suggest how can I resolve this issue.
> 
> I don't know, but the same problem has been reported on Stackoverflow:
> https://stackoverflow.com/q/46617329/6464308
> 
> If your queries look similar, then you might indeed be the victim of an 
> attack.
> 
> Figure out where the function and the executable come from.
> 
> In case of doubt, disconnect the server from the network.

Looks suspicious; I would look at (and save) things like these:

ls -l /proc/19885/exe
ls -l /proc/19885/fd
ls -l /proc/19885/cwd

sudo lsof -n -p 19885
sudo netstat -anpe |grep 19885

Stacktrace with gcore/gdb is a good idea.
Save a copy of your log/postgres logfiles and try to figure out where it came
from.  Since an attacker seems to control the postgres process, your data may
have been compromized (leaked or tampered with).

Justin

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-15 Thread Justin Pryzby

On Tue, Dec 12, 2017 at 01:29:48AM -0800, Jeff Janes wrote:
> On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby  wrote:
> > On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:

> > > In any case, given that we do this calculation without regard
> > > to any specific index,
> >
> > One solution is to compute stats (at least correlation) for all indices,
> > not
> > just expr inds.  I did that earlier this year while throwing around/out
> > ideas.
> > https://www.postgresql.org/message-id/20170707234119.
> > GN17566%40telsasoft.com
> 
> When is the correlation of a column which is not the leading column of a
> btree index or in a brin index ever used?  If we did compute index-specific
> correlations, we could maybe just drop pure-column correlations.

Yes I think so - correlation is collected for every column, but only used for
indices.

I also have a comment to myself in that patch to force attstattarget=0 for
non-expr indices, to avoid keeping MCV/histogram which duplicates that of their
column.

> Trying to keep it all in my own head: For sufficiently large number of
> > pages,
> > bitmap scan should be preferred to idx scan due to reduced random-page-cost
> > outweighing its overhead in CPU cost.
> 
> 
> But CPU cost is probably not why it is losing anyway.
> 
> Index scans get a double bonus from high correlation.  It assumes that only
> a small fraction of the table will be visited.  And then it assumes that
> the visits it does make will be largely sequential.  I think that you are
> saying that for a large enough table, that last assumption is wrong, that
> the residual amount of non-correlation is enough to make the table reads
> more random than sequential.  Maybe.  Do you have a test case that
> demonstrates this?  If so, how big do we need to go, and can you see the
> problem on SSD as well as HDD?

Right: The "residual"/fine-scale variations (those which are not adequately
represented by correlation metric) are/may be non-sequential, so don't get good
readahead.  

The original issue was with an 75GB table (an inheritence child) and an
analytic query previously taking ~30min at that point taking 4-5 hours due to
random seeks (from duplicate values in a timestamp column with 1second
resolution).  There would've been very little if any of the previous day's
table cached: the child table being queried (by way of its parent) had size
roughly same as the server's RAM, and would've been loaded over the course of
the preceding 6-30hours, and not frequently accessed.  It may be that there's a
sharp change once cache no longer effectively mitigates the random heap reads.

SSD: good question.

Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which
is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
server is in use by our application.

I believe you could scale up the size of the table to see this behavior with
any cache size.  0.0001 controls the "jitter", with smaller values being more
jittery..

postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); 
INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROM 
generate_series(1,) a; VACUUM ANALYZE t;
 public | t| table | pryzbyj | 3458 MB |
relpages | 442478

For comparison purposes/baseline; here's a scan on an SEPARATE index freshly
built AFTER insertions:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 0 
AND 4000;
First invocation:
#1  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=25.660..52575.127 rows=39996029 loops=1)
 Buffers: shared hit=1578644 read=286489 written=1084
Subsequent invocations with (extra) effect from OS cache:
#2  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=61.054..37646.556 rows=39996029 loops=1)
 Buffers: shared hit=1578644 read=286489 written=2223
#3  ->  Index Scan using t_i_idx1 on t  (cost=0.57..1413352.60 rows=39933001 
width=4) (actual time=9.344..31265.398 rows=39996029 loops=1)
 Buffers: shared hit=1578644 read=286489 written=1192

Dropping that index, and scanning a different range on the non-fresh index:

postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 4000 
AND 8000;
#1  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 
width=4) (actual time=95.815..139152.147 rows=40009853 loops=1)
 Buffers: shared hit=1948069 read=316536 written=3411
Rerunning with cache effects:
#2  ->  Index Scan using t_i_idx on t  (cost=0.57..1546440.47 rows=40298277 
width=4) (actual time=203.590..87547.287 rows=40009853 loops=1)
 Buffers: shared hit=1948069 read=316536 written=5712
#3  ->  Index Scan using t_i_idx on t  (c

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-16 Thread Justin Pryzby

On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote:
> SSD: good question.
> 
> Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of 
> which
> is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
> server is in use by our application.
> 
> I believe you could scale up the size of the table to see this behavior with
> any cache size.  0.0001 controls the "jitter", with smaller values being more
> jittery..
> 
> postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); 
> INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROM 
> generate_series(1,) a; VACUUM ANALYZE t;
>  public | t| table | pryzbyj | 3458 MB |
> relpages | 442478

I realized I've made a mistake here; the table is on SSD but not its index...
So all this cost is apparently coming from the index and not the heap.

   ->  Bitmap Heap Scan on t  (cost=855041.91..1901994.06 rows=40298277 
width=4) (actual time=14202.624..27754.982 rows=40009853 loops=1)
 ->  Bitmap Index Scan on t_i_idx1  (cost=0.00..844967.34 rows=40298277 
width=0) (actual time=14145.877..14145.877 rows=40009853 loops=1)

Let me get back to you about that.

Justin

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-16 Thread Justin Pryzby

On Fri, Dec 15, 2017 at 02:54:06PM -0600, Justin Pryzby wrote:
> SSD: good question.
> 
> Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of 
> which
> is being used as OS page cache), and 32GB SSD (with random_page_cost=1).  The
> server is in use by our application.
> 
> I believe you could scale up the size of the table to see this behavior with
> any cache size.  0.0001 controls the "jitter", with smaller values being more
> jittery..

On Sat, Dec 16, 2017 at 01:18:38PM -0600, Justin Pryzby wrote:
> I realized I've made a mistake here; the table is on SSD but not its index...
> So all this cost is apparently coming from the index and not the heap.
> 
>->  Bitmap Heap Scan on t  (cost=855041.91..1901994.06 rows=40298277 
> width=4) (actual time=14202.624..27754.982 rows=40009853 loops=1)
>  ->  Bitmap Index Scan on t_i_idx1  (cost=0.00..844967.34 
> rows=40298277 width=0) (actual time=14145.877..14145.877 rows=40009853 
> loops=1)

I'm rerunning with this:

postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i) 
TABLESPACE tmp; INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROM 
generate_series(1,) a; VACUUM ANALYZE t; CREATE INDEX ON t(i) 
TABLESPACE tmp;

That doesn't seem to invalidate my conclusions regarding the test data.

The non-fresh index:
#1  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 
width=4) (actual time=2.295..60094.704 rows=40009646 loops=1)
Rerun:
#2  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 
width=4) (actual time=1.671..54209.037 rows=40009646 loops=1)
#3  ->  Index Scan using t_i_idx on t  (cost=0.57..1103588.59 rows=39536704 
width=4) (actual time=1.743..46436.538 rows=40009646 loops=1)

Scan fresh index:
   ->  Index Scan using t_i_idx1 on t  (cost=0.57..1074105.46 rows=39536704 
width=4) (actual time=1.715..16119.720 rows=40009646 loops=1)

bitmap scan on non-fresh idx:
   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 
width=4) (actual time=4397.767..9137.541 rows=40009646 loops=1)
 Buffers: shared hit=91235 read=225314
 ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 
width=0) (actual time=4346.556..4346.556 rows=40009646 loops=1)
   Buffers: shared read=139118

seq scan:
   ->  Seq Scan on t  (cost=0.00..1942478.00 rows=39536704 width=4) (actual 
time=6093.269..17880.164 rows=40009646 loops=1)

I also tried an idx only scan (note COUNT i vs j / "eye" vs "jay"), which I
think should be like an index scan without heap costs:

postgres=# SET max_parallel_workers_per_gather=0;SET enable_bitmapscan=off;SET 
enable_indexscan=on; begin; DROP INDEX t_i_idx1; explain(analyze,buffers) 
SELECT COUNT(i) FROM t WHERE i BETWEEN 4000 AND 8000; rollback;
   ->  Index Only Scan using t_i_idx on t  (cost=0.57..928624.65 rows=39536704 
width=4) (actual time=0.515..12646.676 rows=40009646 loops=1)
 Buffers: shared hit=276 read=139118

However, in this test, random reads on the INDEX are still causing a large
fraction of the query time.  When cached by the OS, this is much faster.
Compare:

#1   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 
width=4) (actual time=25498.978..41418.870 rows=40009646 loops=1)
 Buffers: shared read=316549 written=497
 ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 
width=0) (actual time=25435.865..25435.865 rows=40009646 loops=1)
   Buffers: shared read=139118 written=2

#2   ->  Bitmap Heap Scan on t  (cost=543141.78..1578670.34 rows=39536704 
width=4) (actual time=5863.003..17531.860 rows=40009646 loops=1)
 Buffers: shared read=316549 written=31
 ->  Bitmap Index Scan on t_i_idx  (cost=0.00..533257.61 rows=39536704 
width=0) (actual time=5799.400..5799.400 rows=40009646 loops=1)
   Buffers: shared read=139118 written=31

Note that for the test data, the index is a large fraction of the table data
(since the only non-indexed column is nullfrac=1):
 public | t| table | pryzbyj | 3458 MB | 
 public | t_i_idx  | index | pryzbyj | t | 2725 MB | 
 public | t_i_idx1 | index | pryzbyj | t | 2142 MB | 
(that could be 10% smaller with fillfactor=100)

I think the test case are reasonably reproducing the original issue.  Note that
the 2nd invocation of the bitmap scan scanned the index in 5.8sec and the heap
in 11sec, but the 2nd invocation of the index scan took 54sec, of which I
gather ~6sec was from the index.  So there's still 48sec spent accessing the
heap randomly, rather than 11sec sequentially.

I'm also playing with the tables which were the source of the original problem,
for which index reads in bitmap scan do not appear to be a large fraction of
the query time, probably because the index are 1-2% of the table size rather
than 60-70%.  I'll mail about that separately.

Justin

Re: Autoanalyze CPU usage

2017-12-19 Thread Justin Pryzby

On Tue, Dec 19, 2017 at 08:47:52AM -0800, Habib Nahas wrote:
> We operate an RDS postgres 9.5 instance and have periodic CPU spikes to
> 100%. These spikes appear to be due to autoanalyze kicking on our larger
> tables.

Not sure if it'll help you, but for our large, insert-only tables partitioned
by time, I made several changes from default:

 - near the end of month, report queries for previous day's data had poor
   statistics, because autoanalyze scale factor defaults to 0.1, so a table
   analyzed on the 24th of the month won't be analyzed again until the 26th, so
   the histogram shows that there's zero rows for previous day, causing nested
   loop over thousands of rows.
 - for separate reasons, I increased statistics target on our key columns (up
   to 3000 for one column).
 - large stats target on large tables caused (auto)analyze to use large amount
   of RAM.  Therefor I changed our largest tables from monthly partition
   granuliarity (MM) to daily (MMDD).  That creates what's
   traditionally considered to be an excessive number of partitions (and very
   large pg_attribute/attrdef and pg_statistic tables), but avoids the huge RAM
   issue, and works for our purposes (and I hope the traditional advice for
   number of child tables is relaxed in upcoming versions, too).

One possibility is a cronjob to set deafult "scale factor" to a modest/default
values (0.1) during business hours and an aggressive value (0.005) off-hours.
You could do similar with autovacuum_max_workers ... but beware if they're
causing high RAM use.  I believe autovacuum workers try to "play nice" and the
cost are shared between all workers.  But I suspect that's not true for CPU
cost or RAM use, so there's nothing stopping you from having 9 workers each
looping around 2+GB RAM and 100% CPU doing MCV/histogram computation.

Maybe that's of some use.

Justin

Re: Autoanalyze CPU usage

2017-12-19 Thread Justin Pryzby

On Tue, Dec 19, 2017 at 02:37:18PM -0800, Habib Nahas wrote:
> As it happens our larger tables operate as a business log and are also
> insert only.
> 
> - There is no partitioning at this time since we expect to have an
> automated process to delete rows older than a certain date.

This is a primary use case for partitioning ; bulk DROP rather than DELETE.

> - Analyzing doing off-hours sounds like a good idea; if there is no other
> way to determine effect on db we may end up doing that.

You can also implement a manual analyze job and hope to avoid autoanalyze.

> - We have an open schema and heavily depend on jsonb, so I'm not sure if
> increasing the statistics target will be helpful.

If the increased stats target isn't useful for that, I would recommend to
decrease it.

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

Re: Table performance with millions of rows (partitioning)

2017-12-27 Thread Justin Pryzby

On Wed, Dec 27, 2017 at 07:54:23PM -0500, Robert Blayzor wrote:
> Question on large tables…
> 
> When should one consider table partitioning vs. just stuffing 10 million rows 
> into one table?

IMO, whenever constraint exclusion, DROP vs DELETE, or seq scan on individual
children justify the minor administrative overhead of partitioning.  Note that
partitioning may be implemented as direct insertion into child tables, or may
involve triggers or rules.

> I currently have CDR’s that are injected into a table at the rate of over
> 100,000 a day, which is large.
> 
> At some point I’ll want to prune these records out, so being able to just
> drop or truncate the table in one shot makes child table partitions
> attractive.

That's one of the major use cases for partitioning (DROP rather than DELETE and
thus avoiding any following vacuum+analyze).
https://www.postgresql.org/docs/10/static/ddl-partitioning.html#DDL-PARTITIONING-OVERVIEW

Justin

Re: partitioning an existing table

2017-12-29 Thread Justin Pryzby

On Fri, Dec 29, 2017 at 11:37:56PM -0500, Robert Blayzor wrote:
> The docs claim that the master table “should” be empty. It it possible to 
> just create child tables off an existing master table with data, then just 
> inserting data into the new child tables.
> 
> THe plan would be to keep data in the master table and purge it over time 
> until it’s eventually empty, then drop the indexes as well.
> 
> Fully understanding that data needs to be placed in the right child tables. 
> Data outside of those child ranges would remain as “old data” in the master 
> table.
> 
> Just trying to grab if that’s an acceptable migration of live data from a 
> single large table and move into partitioning. Think of it as a very large 
> table of cyclic data that ages out. New data in child tables while removing 
> data from the master table over time.

For PG10 "partitions" (as in relkind='p') the parent is defined as empty
(actually has no underlying storage).

For inheritance (available in and before PG10), the parent may be nonempty,
which works fine, although someone else might find it unintuitive.  (Does the
doc actually say "should" somewhere ?)

You almost certainly want child tables to have constraints, to allow
constraint_exclusion (which is the only reason one child table is more "right"
than any other, besides the associated pruning/retention schedule).

Since you'll be running DELETE rather than DROP on the parent, you might
consider DELETE ONLY..  but it won't matter if your children's constraints are
usable with DELETE's WHERE condition.

Also, note that autoanalyze doesn't know to analyze the PARENT's statistics
when its children are INSERTED/DROPPED/etc.  So I'd suggest to consider ANALYZE
each parent following DROP of its children (or maybe on some more frequent
schedule to handle inserted rows, too).  Perhaps that should be included as a
CAVEAT?
https://www.postgresql.org/docs/10/static/ddl-inherit.html#DDL-INHERIT-CAVEATS

Just curious: are your constraints/indices on starting time or ending time?

BTW depending on your requirements, it may be possible to make pg_dump much
more efficient.  For our data, it's reasonable to assume that a table is
"final" if its constraints exclude data older than a few days ago, and it can
be permanently dumped and excluded from future, daily backups, which makes the
backups smaller and faster, and probably causes less cache churn, etc.  But I
imagine you might have different requirements, so that may be infeasible, or
you'd maybe have to track insertions, either via pg_stat_user_tables, or at the
application layer, and redump the relevant table.

Justin

analyze stats: child vs parent

2017-12-29 Thread Justin Pryzby

This topic is confusing to lots of people, usually including myself, so I'm
hoping to clarify it at least to myself, and maybe provide a good reference or
doc update for others in the future.

autovacuum/analyze automatically scans tables being inserted/updated/deleted
and updates their statistics in pg_class and pg_statistic.  Since PG 9.0 [0,1],
ANALYZE (can) include stats of child tables along with stats of the (ONLY)
parent table.  But, autoanalyze still doesn't know to analyze (typical) empty
parent tables, which need to be manually ANALYZEd to include stats for their
children.

...which leaves one wondering: "which stats are being used?, and why are we
keeping two and apparently sometimes not looking at both/either" ?

I think the explanation is this:
 - Parent table stats without children (pg_statistic.stainherit='f') [2] are
used if you query SELECT ONLY).  Simple enough.

 - Postgres uses rowcount estimate as the primary component of query planning.
When planning a query involving a parent table, its rowcount estimate is
obtained as the sum of the rowcounts for its child nodes (appendrels) - if a
table is excluded by query exclusion, it doesn't even show up in the plan, and
if only a fraction of its rows are returned due to a restrictive clause, that's
reflected in its rowcount estimate and in the estimate of the parent.  So child
tables need to be analyzed for their rowcount (and also for their column stats
which affect rowcount).

 - But, column stats (ndistinct, most-common values, and histogram) are
relatively big, and there's nothing implemented (yet) to intelligently combine
them across child tables in a query.  So postgres, having obtained a rowcount
estimate for parent tables involved in a query, having determined how (or one
way) to join the tables, needs to determine how many rows are expected to
result be output by a join, which uses on parent table's column stats
(ndistinct, MCV list, histogram).

Is that mostly right ?

Toward the future: maybe, with declarative partitioning, combining
selectivities as in [3] is possible now without objectionable planning overhead
(?)

Justin

References
[0] https://www.postgresql.org/docs/9.0/static/release-9-0.html#AEN102560
[1] 
https://www.postgresql.org/message-id/flat/2674.1262040064%40sss.pgh.pa.us#2674.1262040...@sss.pgh.pa.us
[2] https://www.postgresql.org/docs/current/static/catalog-pg-statistic.html
[3] 
https://www.postgresql.org/message-id/flat/7363.1426537103%40sss.pgh.pa.us#7363.1426537...@sss.pgh.pa.us
more:
https://www.postgresql.org/message-id/7806.1310233...@sss.pgh.pa.us
https://www.postgresql.org/message-id/4b3875c602250002d...@gw.wicourts.gov
https://www.postgresql.org/message-id/29559.1287206562%40sss.pgh.pa.us

Re: partitioning an existing table - efficient pg_dump

2017-12-30 Thread Justin Pryzby

On Sat, Dec 30, 2017 at 09:19:05AM -0500, Robert Blayzor wrote:
> On Dec 30, 2017, at 12:38 AM, Justin Pryzby  wrote:

> > BTW depending on your requirements, it may be possible to make pg_dump much
> > more efficient.  For our data, it's reasonable to assume that a table is
> > "final" if its constraints exclude data older than a few days ago, and it 
> > can
> > be permanently dumped and excluded from future, daily backups, which makes 
> > the
> > backups smaller and faster, and probably causes less cache churn, etc.  But 
> > I
> > imagine you might have different requirements, so that may be infeasible, or
> > you'd maybe have to track insertions, either via p
> 
> The idea is only only keep a # of months available for searching over a 
> period of months. Those months could be 3 or more, up to a year, etc. But 
> being able to just drop and entire child table for pruning is very 
> attractive. Right now the average months data is about 2-3 million rows each. 
> Data is just inserted and then only searched. Never updated…
> 
> I also like the idea of skipping all this older data from a PGdump. We 
> archive records inserted into these tables daily into cold storage. ie: 
> export and compressed. So the data is saved cold. We dump the DB nightly 
> also, but probably would make sense to skip anything outside of the newest 
> child table. Just not sure how to make that happen, yet….

For us, I classify the tables as "partitioned" or "not partitioned" and
subdivide "partitioned" into "recent" or "historic" based on table names; but
if you design it from scratch then you'd have the opportunity to keep a list of
partitioned tables, their associated date range, date of most recent insertion,
and most recent "final" backup.

This is the essence of it:
snap= ... SELECT pg_export_snapshot();
pg_dump --snap "$snap" -T "$ptnreg" -f nonpartitioned.new
pg_dump --snap "$snap" -t "$recent" -f recent.new
loop around historic partitioned tables and run "final" pg_dump if it's been
  INSERTed more recently than it's been dumped.
remove any "final" pg_dump not included in any existing backup (assuming you
  keep multiple copies on different rotation).

Note that pg_dump -t/-T is different from "egrep" in a few special ways..

Justin

Re: HDD vs SSD without explanation

2018-01-14 Thread Justin Pryzby

On Sun, Jan 14, 2018 at 12:44:00PM -0800, Neto pr wrote:
> Dear all
> 
> Someone help me analyze the two execution plans below (Explain ANALYZE
> used), is the  query 9 of TPC-H benchmark [1].
>
> I'm using a server HP Intel Xeon 2.8GHz/4-core - Memory 8GB HDD SAS 320GB
> 15 Krpm AND SSD Sansung EVO 500GB.
>
> I think maybe the execution plan is using more write operations, and so the
> HDD SAS 15Krpm has been faster.

The query plan is all garbled by mail , could you resend?  Or post a link from
https://explain.depesz.com/

To see if the query is causing many writes (due to dirty pages, sorts, etc),
run with explain(analyze,buffers) 

But from what I could tell, your problems are here:

->  Parallel Seq Scan on lineitem  (cost=0.00..5861332.93 rows=15093 
width=41) (actual TIME=3.494..842667.110 rows=80004097 loops=3)
vs
->  Parallel Seq Scan on lineitem  (cost=0.00..5861333.40 rows=15140 
width=41) (actual TIME=41.805..224438.909 rows=80004097 loops=3)

->  Seq Scan on partsupp (cost=0.00..1052983.08 rows=31999708 width=22) (actual 
TIME=0.033..228828.149 rows=3200 loops=3)
vs
->  Seq Scan on partsupp  (cost=0.00..1052934.38 rows=31994838 width=22) 
(actual TIME=0.037..37865.003 rows=3200 loops=3)

Can you reproduce the speed difference using dd ?
time sudo dd if=/dev/sdX of=/dev/null bs=1M count=32K skip=$((128*$RANDOM/32)) 
# set bs to optimal_io_size 

Or: bonnie++ -f -n0

What OS/kernel are you using?  LVM?  filesystem?  I/O scheduler?  partitions?
readahead?  blockdev --getra

If you're running under linux, maybe you can just send the output of:
for a in /sys/block/sdX/queue/*; do echo "$a `cat $a`"; done
or: tail 
/sys/block/sdX/queue/{minimum_io_size,optimal_io_size,read_ahead_kb,scheduler,rotational,max_sectors_kb,logical_block_size,physical_block_size}

Justin

Re: HDD vs SSD without explanation

2018-01-14 Thread Justin Pryzby

On Sun, Jan 14, 2018 at 06:25:40PM -0800, Neto pr wrote:
> > The query plan is all garbled by mail , could you resend?  Or post a link 
> > from
> > https://explain.depesz.com/

On Sun, Jan 14, 2018 at 06:36:02PM -0800, Neto pr wrote:
> I was not able to upload to the site, because I'm saving the execution
> plan in the database, and when I retrieve it, it loses the line breaks,

That's why it's an issue for me, too..

> > What OS/kernel are you using?  LVM?  filesystem?  I/O scheduler?  
> > partitions?
> 
> See below the Disk FileSystem 
> root@hp2ml110deb:/# fdisk -l
> Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
> 
> Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> 
What about sdb partitions/FS?

On Sun, Jan 14, 2018 at 06:25:40PM -0800, Neto pr wrote:
> The DBMS and tablespace of users is installed in /dev/sdb  SSD.

Is that also a temp_tablespace ?  Or are your hashes spilling to HDD instead ?

Group Key: nation.n_name, (date_part(_year_::text, 
(orders.o_orderdate)::timestamp without time zone))
Buffers: shared hit=3773802 read=7120852, temp read=3550293 written=3541542

Are your SSD being used for anything else ?

What about these?

> > readahead?  blockdev --getra

> > If you're running under linux, maybe you can just send the output of:
> > for a in /sys/block/sdX/queue/*; do echo "$a `cat $a`"; done
> > or: tail 
> > /sys/block/sdX/queue/{minimum_io_size,optimal_io_size,read_ahead_kb,scheduler,rotational,max_sectors_kb,logical_block_size,physical_block_size}

> > Can you reproduce the speed difference using dd ?
> > time sudo dd if=/dev/sdX of=/dev/null bs=1M count=32K 
> > skip=$((128*$RANDOM/32)) # set bs to optimal_io_size
> >
> > Or: bonnie++ -f -n0

Justin

Re: HDD vs SSD without explanation

2018-01-15 Thread Justin Pryzby

On Mon, Jan 15, 2018 at 05:19:59PM -0800, Neto pr wrote:
> >> Can you reproduce the speed difference using dd ?
> >> time sudo dd if=/dev/sdX of=/dev/null bs=1M count=32K
> >> skip=$((128*$RANDOM/32)) # set bs to optimal_io_size
> >
> > Still I would have expected somewhat similar results in the outcome, so yes,
> > it is weird that the SAS drive doubles the SSD performance. That is why I
> > think there must be something else going on during your tests on the SSD
> > server. It can also be that the SSD isn't working properly or you are
> > running an suboptimal OS+server+controller configuration for the drive.
> 
> Ok.
> 
> Can you help me to analyze the output of the command: dd if=/dev/sdX
> of=/dev/null bs=1M count=32K skip=$((128*$RANDOM/32)) # set bs to
> optimal_io_size

You should run the "dd" without the DB or anything else using the drive.  That
gets peformance of the drive, without the DB.

You should probably rerun the "dd" command using /dev/sdb1 if there's an
partition table on top (??).

I'm still wondering about these:

On Sun, Jan 14, 2018 at 09:09:41PM -0600, Justin Pryzby wrote:
> What about sdb partitions/FS?

> > > readahead?  blockdev --getra
> 
> > > If you're running under linux, maybe you can just send the output of:
> > > for a in /sys/block/sdX/queue/*; do echo "$a `cat $a`"; done
> > > or: tail 
> > > /sys/block/sdX/queue/{minimum_io_size,optimal_io_size,read_ahead_kb,scheduler,rotational,max_sectors_kb,logical_block_size,physical_block_size}

Justin

Re: Bad plan

2018-01-23 Thread Justin Pryzby

On Tue, Jan 23, 2018 at 01:03:49PM +0100, Laurent Martelli wrote:
> Hello all,
> 
> So I have a view, for which I can select all rows in about 3s (returns ~80k
> rows), but if I add a where clause on a column, it takes +300s to return
> the ~8k lines.
> 
> From the plan, I see that it expects to return only 1 row and so choose to
> perform some nested loops. Of course, I did run "ANALYZE", but with no
> success.
> 
> I managed to speed things up with "set enable_nestloop = false;", but is
> that the only choice I have ? Should I report a bug ?


> Here is the default plan :

Can you resend without line breaks or paste a link to explain.depesz?

The problem appears to be here:

->  Nested Loop Left Join  (cost=32067.09..39197.85 rows=1 width=276) (actual 
time=342.725..340775.031 rows=7359 loops=1)
Join Filter: (sh.share_holder_partner = partner.id)
Rows Removed by Join Filter: 204915707

Justin

Re: Query Slow After 2018

2018-01-28 Thread Justin Pryzby

On Mon, Jan 29, 2018 at 12:32:59AM +0700, Nur Agus wrote:
> The following query run in just 9 ms:

>  "distrib_reports"."month" = 1 AND
>  "distrib_reports"."year" = 2017 AND
>  "distrib_reports"."state" = 'SUBMITTED' AND
>  "distrib_report_groups"."distrib_report_group_type_id" =
> '559a5fdc-418d-4494-aebf-80ecf8743d35'

> The explain analyze of the 2 queries are resulting on really different
> query plan, here are the links to depesz:
> 2017 --> explain result on postgres-9: https://explain.depesz.com/s/qJF1
> 2018 --> explain result on postgres-9: https://explain.depesz.com/s/pT0y

> The question is, why the query planner choose such very different path just
> by changing one parameter?

Looks like this badly underestimates its rowcount:

Index Scan using index_distrib_reports_on_year on distrib_reports 
(cost=0.42..40.62 rows=8 width=32) (actual time=0.034..50.452 rows=17,055 
loops=1)
Index Cond: (year = 2018)
Filter: ((month = 1) AND ((state)::text = 'SUBMITTED'::text))
Rows Removed by Filter: 1049

Maybe because "if year==2018" then, month=1 does essentialy nothing ..
..but postgres thinks it'll filters out some 90% of the rows.

And possibly the same for SUBMITTED (?)
You should probably use timestamp column rather than integer year+month.

On PG10, you could probably work around it using "CREATE STATISTICS".

> This behaviour is *not-reproducable* on postgres-10. On postgres-10, the
> query plan are consistent, and both have very acceptable time:
> 2017 --> explain result on postgres-10: https://explain.depesz.com/s/N9r5
> 2018 --> --> explain result on postgres-10:
> https://explain.depesz.com/s/Tf5K
..I think default max_parallel_workers_per_gather=3 by chance causes the plan
to be the same.

I think there's still a underestimate rowcount with PG10 (without CREATE
STATISTICS), but it's masked by "rows=1 actual rows=0" roundoff error with
high loop count.

Justin

Re: postgresql 10.1 wrong plan in when using partitions bug

2018-02-04 Thread Justin Pryzby

On Sun, Feb 04, 2018 at 05:06:38PM +0200, Mariel Cherkassky wrote:
> Great, it solved the issue. Seems problematic that the planner do full
> scans on all partitions in the first case isnt it ? Seems like a bug ?

See also:
https://www.postgresql.org/message-id/20170725131650.GA30519%40telsasoft.com
https://www.postgresql.org/message-id/20170825154434.GC16287%40telsasoft.com

Justin

2018-02-04 16:54 GMT+02:00 Andreas Kretschmer :
> 
> >
> >
Am 04.02.2018 um 13:19 schrieb Mariel Cherkassky:
> >
> >> I checked the plan of the next query :
> >> explain select count(*) from log_full where end_date between
> >> to_date('2017/12/03','YY/MM/DD') and to_date('2017/12/03','YY/MM/DD');
> >>
> >>
> > can you rewrite the query to
> >
> > ... where end_date between '2017/12/03' and '2017/12/03'
> >
> > maybe the planner should be smart enough to do that for you, but obvously
> > he can't. So it's a workaround, but it seems to solve the problem.

Re: postgresql 10.1 wrong plan in when using partitions bug

2018-02-05 Thread Justin Pryzby

On Sun, Feb 04, 2018 at 05:28:52PM +0200, Mariel Cherkassky wrote:
> I read those two links and I dont think that they are relevant because : 1
> 1)I didnt do any join.
> 2)I used a where clause in my select

https://www.postgresql.org/docs/current/static/ddl-partitioning.html
|The following caveats apply to constraint exclusion:
| Constraint exclusion only works when the query's WHERE clause contains
|constants (or externally supplied parameters). For example, a comparison
|against a non-immutable function such as CURRENT_TIMESTAMP cannot be optimized,
|since the planner cannot know which partition the function value might fall
|into at run time.
[..]

The issue is with the comparison between function call to to_date() compared
with constant - that doesn't allow constraint exclusion as currently
implemented.

Justin

2018-02-04 16:54 GMT+02:00 Andreas Kretschmer :
> > Am 04.02.2018 um 13:19 schrieb Mariel Cherkassky:
> > > >
> > > >> I checked the plan of the next query :
> > > >> explain select count(*) from log_full where end_date between
> > > >> to_date('2017/12/03','YY/MM/DD') and to_date('2017/12/03','YY/MM/
> > DD');
> > > >>
> > > >>
> > > > can you rewrite the query to
> > > >
> > > > ... where end_date between '2017/12/03' and '2017/12/03'
> > > >
> > > > maybe the planner should be smart enough to do that for you, but 
> > > > obvously
> > > > he can't. So it's a workaround, but it seems to solve the problem.

Re: failing to use index on UNION of matviews (Re: postgresql 10.1 wrong plan in when using partitions bug)

2018-02-06 Thread Justin Pryzby

On Sun, Feb 04, 2018 at 11:04:56AM -0500, Rick Otten wrote:
> On Sun, Feb 4, 2018 at 10:35 AM, Tom Lane  wrote:
> 
> > Rick Otten  writes:
> > > I'm wrestling with a very similar problem too - except instead of
> > official
> > > partitions I have a views on top of a bunch (50+) of unioned materialized
> > > views, each "partition" with 10M - 100M rows.  On 9.6.6 the queries would
> > > use the indexes on each materialized view.  On 10.1, every materialized
> > > view is sequence scanned.

I think it'd be useful to see the plan from explain analyze, on both the
"parent" view and a child, with and without SET enable_seqscan=off,

Justin

Re: effective_io_concurrency on EBS/gp2

2018-02-06 Thread Justin Pryzby

On Wed, Jan 31, 2018 at 04:34:18PM -0300, Claudio Freire wrote:
> In my experience playing with prefetch, e_i_c>0 interferes with kernel
> read-ahead. What you've got there would make sense if what postgres
> thinks will be random I/O ends up being sequential. With e_i_c=0, the
> kernel will optimize the hell out of it, because it's a predictable
> pattern. But with e_i_c=1, the kernel's optimization gets disabled but
> postgres isn't reading much ahead, so you get the worst possible case.

On Thu, Feb 01, 2018 at 03:39:07PM -0300, Claudio Freire wrote:
> Problem is, if you're scanning a highly correlated index, the
> mechanism is counterproductive. I had worked on some POC patches for
> correcting that, I guess I could work something out, but it's
> low-priority for me. Especially since it's actually a kernel "bug" (or
> shortcoming), that could be fixed in the kernel rather than worked
> around by postgres.

On Sun, Feb 04, 2018 at 11:27:25PM -0300, Claudio Freire wrote:
> ... Dense scans have large portions of contiguous fetches, a pattern that is
> quite adversely affected by the current prefetch mechanism in linux.
> 
> ... There's a rather simple workaround for this, pg should just avoid issuing
> prefetch orders for sequential block patterns, since those are already much
> better handled by the kernel itself.

Thinking out loud.. if prefetch were a separate process, I imagine this
wouldn't be an issue ; is it possible the parallel worker code could take on
responsibility of prefetching (?)

Justin

Re: Slow query on partitioned table.

2018-03-27 Thread Justin Pryzby

On Tue, Mar 27, 2018 at 03:14:30PM +0100, Glenn Pierce wrote:
> Hi I am having terrible trouble with a simple partitioned table.
> Select queries are very slow.

> The child tables are all like
> Check constraints:
> "sensor_values_2018q1_timestamp_check" CHECK (ts >= '2018-01-01
> 00:00:00+00'::timestamp with time zone AND ts < '2018-04-01
> 01:00:00+01'::timestamp with time zone)

> EXPLAIN (ANALYZE, BUFFERS)  SELECT ts::timestamptz, s1.sensor_id, s1.value
>   FROM sensor_values s1
>WHERE s1.sensor_id =
> ANY(ARRAY[596304,597992,610978,597998])
>  AND s1.ts >= '2000-01-01
> 00:01:01'::timestamptz AND
>  s1.ts < '2018-03-20
> 00:01:01'::timestamptz

> Shows the following output
> https://explain.depesz.com/s/c8HU

It's scanning all partitions, so apparently constraint_exclusion isn't working.

Is it because the CHECK has ts with different timezones +00 and +01 ??

Also, it looks funny to use 00:01:01 as the beginning of the day (although I
think it's true that an HR department would understand that better..).

Justin

Re: Slow query on partitioned table.

2018-03-27 Thread Justin Pryzby

Re-added -performance.

On Tue, Mar 27, 2018 at 05:13:25PM +0100, Glenn Pierce wrote:
> Damn as I was playing with the indexes I must have deleted the constraints :(
> Question if I have a constraint like
> 
> ALTER TABLE sensor_values_2007q1
>   ADD CONSTRAINT sensor_values_2007q1_sensor_id_timestamp_constraint
> UNIQUE (sensor_id, ts);
> 
> will that be used like an index  or do I need to add a separate index ?

Yes:

https://www.postgresql.org/docs/current/static/ddl-constraints.html
|Adding a unique constraint will automatically create a unique B-tree index on
the column or group of columns listed in the constraint

https://www.postgresql.org/docs/current/static/indexes-unique.html
|PostgreSQL automatically creates a unique index when a unique constraint or
|primary key is defined for a table. The index ... is the mechanism that
|enforces the constraint.

Justin

Re: Performance issues while running select sql query

2018-04-28 Thread Justin Pryzby

On Sun, Apr 29, 2018 at 10:05:23AM +0530, Kaushal Shriyan wrote:
> Hi,
> 
> I am running postgresql db server 9.4.14 on AWS of C4.2xlarge instance type
> with 500 GB volume of volume type io1 with 25000 IOPS and I am seeing
> performance issues. The sql query response takes around *127713.413 ms *time
> *.* Is there a way to find out the bottleneck?

Send the output of "explain(analyze,buffers)" for the query?

Justin

Re: Performance issues while running select sql query

2018-04-29 Thread Justin Pryzby

On Sun, Apr 29, 2018 at 10:05:23AM +0530, Kaushal Shriyan wrote:
> # SELECT serorgid,appname,sum(message_count) AS mtrc0,date_trunc('day',
> client_received_start_timestamp at time zone '+5:30:0')::timestamp without
> time zone AS time_unit FROM analytics."test.prod.fact" WHERE
> client_received_start_timestamp >= '2018-3-28 18:30:0' AND
> client_received_start_timestamp < '2018-4-11 18:30:0' AND  ((apiproxy in
> ('test-service' )  ) and (exchangeinstance != '(not set)'  ) and (devemail
> != 't...@example.com'  ) and (devemail != 's...@example.com'  ) and
> (devemail != 'a...@example.com'  ) and (devemail != 'x...@example.com'  ) and
> (apistatus = 'Success'  ) and (apiaction not in
> ('LRN','finder','ManuallySelect' )  ) and (appname not in ('Mobile Connect
> Developer Portal (Int(', 'MinskHBM', 'LondonHBM', 'SeoulHBM', 'MumbaiHBM',
> 'NVirginiaHBM','SPauloHBM', 'Mobile Connect HeartBeat Monitor',
> 'PDMAOpenSDKTest1', 'PDMAOpenSDKTest2', 'PDMASDKTest', 'APIHealth',
> 'A1qaDemoApp','test', 'dublin o2o test tool', 'Test from John do not
> provision' )  ) and (serorgid = 'aircel'  ))  GROUP BY
> serorgid,appname,time_unit ORDER BY time_unit DESC LIMIT 14400 OFFSET 0;

This table has inheritence children.  Do they have constraints?  On what
column?  Is constraint_exclusion enabled and working for that?

It looks like test.prod.fact_624 is being read using index in under 1sec, and
the rest using seq scan, taking 5-10sec.

So what are the table+index definitions of the parent and childs (say fact_624
and 631).

Have the child tables been recently ANALYZE ?
Also, have you manually ANALYZE the parent table?

On Sun, Apr 29, 2018 at 10:48:48AM +0530, Kaushal Shriyan wrote:
> >  QUERY PLAN
> >  Limit  (cost=2568002.26..2568038.26 rows=14400 width=35) (actual 
> > time=127357.296..127357.543 rows=231 loops=1)
> >Buffers: shared hit=28019 read=1954681
...

> >->  Index Scan using 
> > "test.prod.fact_624_client_received_start_timestamp_idx" on 
> > "test.prod.fact_624"  (cost=0.42..10948.27 rows=1002 width=34) (actual 
> > time=3.034..278.320 rows=1231 loops=1)
> >  Index Cond: 
> > ((client_received_start_timestamp >= '2018-03-28 18:30:00'::timestamp 
> > without time zone) AND (client_received_start_timestamp < '2018-04-11 
> > 18:30:00'::timestamp without time zone))
> >  Rows Removed by Filter: 42629
> >  Buffers: shared hit=27966 read=498
> >->  Seq Scan on "test.prod.fact_631" 
> > (cost=0.00..171447.63 rows=16464 width=34) (actual time=0.070..7565.812 
> > rows=20609 loops=1)
> >  Rows Removed by Filter: 645406
> >  Buffers: shared hit=2 read=132279
...

Re: Time bucketing query performance

2018-05-07 Thread Justin Pryzby

On Mon, May 07, 2018 at 07:33:17PM -0400, Julian Wilson wrote:
> Hi,
> 
> I'm trying to debug improve the performance of my time bucketing query.
> What I'm trying to do is essentially bucket by an arbitrary interval and
> then do some aggregations within that interval (min,max,sum, etc). I am
> using a `max` in the query I posted. For context in the data, it is 1
> minute candles of cryptocurrency data (open price, high price, low price,
> close price, volume, for an interval). I want to transform this to a 10
> minute interval, on demand, and that is what this query is meant to do.
> 
> I understand the slow part of my query is in the LEFT JOIN, but I just
> can't quite figure out how to do it without the LEFT JOIN.
> 
> Here is my pastebin with all the details so I don't clutter the message. I
> tried to follow everything in the 'Slow Query Questions' WIKI page. There
> is also a depesz link there.
> 
> https://ybin.me/p/9d3f52d88b4b2a46#kYLotYpNuIjjbp2P4l3la8fGSJIV0p+opH4sPq1m2/Y=

Thsse may not be a substantial part of the issue, but I have some suggestions:

0) You're using CTE, which cannot have stats (unlike temporary table).  Can you
rewrite without, perhaps with GROUP BY date_trunc('hour', time_open) ?

1) you're querying on start_time AND end_time, and I believe the planner thinks
those conditions are independent, but they're not.  Try getting rid of the
frame_end and move the "5 months" into the main query using BETWEEN or two
ANDed conditions on the same variable.  See if the rowcount estimate is more
accurate:

->  Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv  
(cost=0.56..2488.58 ROWS=12110 width=22) (actual time=3.709..4.403 ROWS=9 
loops=3625)
  Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 
'ETHBTC'::text) AND (time_open >= g.start_time))
  Filter: (time_close < g.end_time)

Alternately, you could try:
CREATE STATISTICS (dependencies) ON (time_open,time_close) FROM 
historical_ohlcv ;
ANALYZE historical_ohlcv;

2) Is your work_mem really default?  64kb?  Recommend changing it to see if the
plan changes (although it looks like that's not the issue).

3) If you have SSD, you should probably CREATE TABLESPACE tmp LOCATION 
/srv/pgsql_tmp and
ALTER SYSTEM SET temp_tablespaces='tmp' and SELECT pg_reload_conf();

4) If those don't help, as a test, try running with SET enable_nestloop=off.
I'm guessing that fixing rowcount estimate in (1) will be sufficient.

5) May not be important, but rerun with explain (ANALYZE,BUFFERS) and show the
results.

Justin

Re: Help with tuning slow query

2018-05-19 Thread Justin Pryzby

On Fri, May 18, 2018 at 08:32:55PM +, Kotapati, Anil wrote:
> We are facing issues with one of our query, when we use order by count it is 
> taking lot of time to execute the query. To be precise it is taking 9 min to 
> execute the query from table which has ~220 million records. Is there a way 
> to make this query run faster and efficiently using order by count. Below is 
> the query which I’m trying to run
> 
> Select account_number, sum(count_of_event) as "error_count"
> FROM event_daily_summary
> group by account_number,event_date,process_name
> having event_date >= '2018-05-07'
> and process_name='exp90d_xreerror'
> order by sum(count_of_event) desc
> limit 5000

Would you provide the information listed here ?  Table definition, query plan, 
etc
https://wiki.postgresql.org/wiki/Slow_Query_Questions

Also, why "HAVING" ? Shouldn't you use WHERE ?

Does the real query have conditions on event_date and process name or is that
just for testing purposes?

Justin

Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!

2018-05-22 Thread Justin Pryzby

On Tue, May 22, 2018 at 03:32:59AM -0700, pavan95 wrote:
> Sample query plan: 
> Limit  (cost=92129.35..92129.63 rows=10 width=248)

Would you send the output of explain(analyze,buffers) for the whole query ?
And/or paste it into explain.depesz site and send a link.

Justin

Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!

2018-05-22 Thread Justin Pryzby

On Tue, May 22, 2018 at 03:51:44AM -0700, pavan95 wrote:
> Please find the output of explain(analyze,buffers) for the whole query in
> the below link.

> Seq Scan on res_users users (cost=750.92..1,836.69 rows=249 width=15) (actual 
> time=3.962..17.544 rows=67 loops=1) 

Not sure but would you try creating an index on:
res_users.res_employee_id

> Seq Scan on res_users user1 (cost=0.00..58.03 rows=1,303 width=15) (actual 
> time=0.002..0.002 rows=1 loops=1)

Also the planner's estimate for table:res_users is off by 1300x..so you should
probably vacuum analyze it then recheck.  I don't think we know what version
postgres you have, but last week's patch releases include a fix which may be
relevant (reltuples including dead tuples).

Also I don't know the definition of this table or its indices:
tms_workflow_history

..but it looks like an additional or modified index or maybe clustering the
table on existing index might help (active? is_final_approver?)
Or maybe this should be 3 separate indices rather than composite index?
Perhaps some of those could be BRIN indices, depending on postgres version

Justin

Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!

2018-05-23 Thread Justin Pryzby

On Wed, May 23, 2018 at 12:01:06AM -0700, pavan95 wrote:
> As said, created index on the res_users.res_employee_id and the below link
> is the explain plan result.
> 
> Link:  https://explain.depesz.com/s/hoct
> 
> And the cost of Previous query is 92,129 and the cost of current modified
> query after creating the above said index is 91,462. But good thing is we

Forget the cost - that's postgres *model* of the combined IO+CPU.
If the model is off, that's may cause bad plans and could be looked into
further.

In any case, that index cut your runtime from 75sec to 60sec (in spite of the
modelled cost).

It looks like you resolved the bad estimate on the users table?

> 2. tms_workflow_history:
> Indexes:
> "tms_workflow_history_pkey" PRIMARY KEY, btree (id)
> "curract_state_isfinal_app_idx" btree (current_activity, state, 
> is_final_approver)
> "timesheet_id_group_id_active_idx" btree (timesheet_id, group_id, active)
> "tms_wkf_his_active_is_final_approveridx" btree (active, 
> is_final_approver)
> "tms_wkf_his_group_id_idx" btree (group_id)
> "tms_wkf_his_timesheet_id_idx" btree (timesheet_id)
> "tms_wkf_hist_current_activity_idx" btree (current_activity)
> "tms_wkf_hist_state_idx" btree (state)
> "wsr_id_idx" btree (wsr_id)

How big is the table ?  And curract_state_isfinal_app_idx ?
Have these been reindexed (or pg_repacked) recently?

It seems to me that the remaining query optimization is to improve this:
> Bitmap Heap Scan on tms_workflow_history history (cost=193.19..1,090.50 
> rows=6,041 width=12) (actual time=3.692..15.714 rows=11,351 loops=1)

I think you could consider clustering (or repacking) the table on
curract_state_isfinal_app_idx (but you'll have to judge if that's okay and
won't negatively affect other queries).

But, what's your target runtime ?  Improvements here could cut at most 15sec
off the total 60sec.  If you're hoping to save more than that, you'll need to
(also) look further than the query:

 - postgres parameters: what are shared_buffers, work_mem, effective_cache_size 
?
   + https://wiki.postgresql.org/wiki/Server_Configuration
 - are there other DBs/applications running on the server/VM ?
 - kernel tuning (readahead, VM parameters, write cache, scheduler, THP, etc)
 - server hardware (what OS? storage? RAM? filesystem?)
 - how does the storage perform outside of postgres?
   + something like this: /usr/sbin/bonnie++ -f -n0 -x4 -d /var/lib/pgsql

Justin

Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!

2018-05-23 Thread Justin Pryzby

On Wed, May 23, 2018 at 07:03:18AM -0700, pavan95 wrote:
> Please find the explain plan which got increased again vastly. Is this
> because of the increase in rows?
> 
> Link :  https://explain.depesz.com/s/Ifr   

That's explain without "analyze", so not very useful.

There's handful of questions:

On Wed, May 23, 2018 at 08:43:22AM -0500, Justin Pryzby wrote:
>  - postgres parameters: what are shared_buffers, work_mem, 
> effective_cache_size ?
>+ https://wiki.postgresql.org/wiki/Server_Configuration
>  - are there other DBs/applications running on the server/VM ?
>  - kernel tuning (readahead, VM parameters, write cache, scheduler, THP, etc)
>  - server hardware (what OS? storage? RAM? filesystem?)
>  - how does the storage perform outside of postgres?
>+ something like this: /usr/sbin/bonnie++ -f -n0 -x4 -d /var/lib/pgsql

Justin

Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!

2018-05-23 Thread Justin Pryzby

On Wed, May 23, 2018 at 10:20:42PM -0700, pavan95 wrote:
> Hi Justin,
> 
> Please find the below explain plan link.
> 
> Link:  https://explain.depesz.com/s/owE   

That's explain analyze but explain(analyze,buffers) is better.

Is this on a completely different server than the previous plans ?

This rowcount misestimate appears to be a significant part of the problem:
 Merge Join (cost=228.77..992.11 ROWS=20 width=22) (actual time=4.353..12.439 
ROWS=343 loops=1)
Merge Cond: (history_2.timesheet_id = header_2.id)

You could look at the available stats for that table's column in pg_stats.
Is there an "most common values" list?
Maybe you need to ALTER TABLE .. SET STATISTICS 999 (or some increased value)
and re-analyze ?

You can see these are also taking large component of the query time:

 Bitmap Index Scan on ts_detail_release_no_idx (cost=0.00..33.86 rows=1,259 
width=0) (actual time=0.304..0.304 rows=1,331 LOOPS=327)
Index Cond: ((release_no)::text = 'paid_time_off'::text)
...
 Bitmap Index Scan on ts_detail_release_no_idx (cost=0.00..33.86 rows=1,259 
width=0) (actual time=0.304..0.304 rows=1,331 LOOPS=343)
Index Cond: ((release_no)::text = 'paid_time_off'::text)

I wonder whether it would help to
CREATE INDEX ON tms_timesheet_details(timesheet_header_id) WHERE
((release_no)::text = 'paid_time_off'::text);

In addition to the other settings I asked about, it might be interesting to
SHOW effective_io_concurrency;

You're at the point where I can't reasonably contribute much more.

Justin

propose web form for submission of performance problems

2018-05-24 Thread Justin Pryzby

What would the list think of a web form for submitting problems the performance
list, similar to the pgsql-bugs form?

Alternately, or perhaps additionally, a script (hopefully bundled with
postgres) which collects at least the non-query specific info and probably
creates .logfile file for attachment.

I assume fields would be mostly the content/questions off the SlowQuery wiki
page, plus everything else asked with any frequency.

There could also be "required" fields..

Justin

Re: propose web form for submission of performance problems

2018-05-24 Thread Justin Pryzby

On Thu, May 24, 2018 at 06:27:31PM -0700, Craig James wrote:
> On Thu, May 24, 2018 at 4:57 PM, Justin Pryzby  wrote:
> 
> > What would the list think of a web form for submitting problems the
> > performance
> > list, similar to the pgsql-bugs form?
> >
> > Alternately, or perhaps additionally, a script (hopefully bundled with
> > postgres) which collects at least the non-query specific info and probably
> > creates .logfile file for attachment.
> >
> > I assume fields would be mostly the content/questions off the SlowQuery
> > wiki
> > page, plus everything else asked with any frequency.
> >
> > There could also be "required" fields..
> >
> 
> Is there something wrong with this email group? I actually to like it. Low

I meant something exactly like the bug form:
https://www.postgresql.org/account/login/?next=/account/submitbug/
https://www.postgresql.org/list/pgsql-bugs/2018-05/
"Entering a bug report this way causes it to be mailed to the
 mailing list."

The goal was to continue the list but encourage problem reports to include all
of the most commonly-needed(missing) information, and in a standard format too.

But actually I see that nowadays it requires an account, which isn't great and
might discourage many people from using the web form.  But maybe that's better
for some people than requiring subscription to the list.

Justin

Re: Possible optimisation: push down SORT and LIMIT nodes

2018-05-30 Thread Justin Pryzby

On Wed, May 30, 2018 at 03:46:40PM +, Christopher Wilson wrote:
> We have a query which is rather slow (about 10 seconds), and it looks like 
> this:
> 
> The inventory table has the quantity of each asset in the inventory on each
> date (complete SQL to create and populate the tables with dummy data is
> below). The query plan looks like this (the non-parallel version is similar):

Hi,

Thanks for including the test case.

> Limit  (cost=217591.77..217603.60 rows=100 width=32) (actual 
> time=9122.235..9122.535 rows=100 loops=1)
...
>  ->  Sort  (cost=216591.73..220628.83 rows=1614839 width=32) (actual 
> time=8879.909..8880.030 rows=727 loops=4)
>Sort Key: inventory.date, asset.name
>Sort Method: external merge  Disk: 50904kB
>Buffers: shared hit=27365, temp read=25943 written=25947

Yep, the sort is expensive and largely wasted..

> I'm imagining something like a sort-limit-finish node, which sorts its input
> and then returns at least the limit number of rows, but keeps returning rows
> until it exhausts the last sort prefix that it read.
[...]
> Does this sound correct, reasonable and potentially interesting to Postgres
> developers?

I think your analysis may be (?) unecessarily specific to your specific problem
query.

For diagnostic purposes, I was able to to vastly improve the query runtime with
a CTE (WITH):

|postgres=# explain(analyze,buffers) WITH x AS (SELECT inventory.date, 
asset.name, inventory.quantity FROM temp.inventory LEFT JOIN temp.asset ON 
asset.id=id_asset LIMIT 99) SELECT * FROM x ORDER BY date, name;
| Sort  (cost=1090.60..1090.85 rows=99 width=40) (actual time=3.764..3.988 
rows=99 loops=1)
|   Sort Key: x.date, x.name
|   Sort Method: quicksort  Memory: 32kB
|   Buffers: shared hit=298
|   CTE x
| ->  Limit  (cost=0.28..889.32 rows=99 width=31) (actual time=0.063..2.385 
rows=99 loops=1)
|   Buffers: shared hit=298
|   ->  Nested Loop Left Join  (cost=0.28..44955006.99 rows=5006001 
width=31) (actual time=0.058..1.940 rows=99 loops=1)
| Buffers: shared hit=298
| ->  Seq Scan on inventory  (cost=0.00..5033061.00 
rows=5006001 width=12) (actual time=0.020..0.275 rows=99 loops=1)
|   Buffers: shared hit=1
| ->  Index Scan using asset_pkey on asset  (cost=0.28..7.98 
rows=1 width=27) (actual time=0.008..0.008 rows=1 loops=99)
|   Index Cond: (id = inventory.id_asset)
|   Buffers: shared hit=297
|   ->  CTE Scan on x  (cost=0.00..198.00 rows=99 width=40) (actual 
time=0.073..2.989 rows=99 loops=1)
| Buffers: shared hit=298
| Planning time: 0.327 ms
| Execution time: 4.260 ms

It's not clear to me if there's some reason why the planner couldn't know to
use a similar plan (sort-limit-... rather than limit-sort-...)

Justin

Re: Possible optimisation: push down SORT and LIMIT nodes

2018-05-30 Thread Justin Pryzby

On Wed, May 30, 2018 at 02:02:31PM -0500, Justin Pryzby wrote:
> For diagnostic purposes, I was able to to vastly improve the query runtime 
> with
> a CTE (WITH):

I realized this was broken as soon as I sent it (for the essential reason of
discarding rows before having sorted them).  Sorry for the noise.

Justin

Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-03-01 Thread Justin Pryzby

On Tue, Mar 01, 2022 at 04:28:31PM +, l...@laurent-hasson.com wrote:
> Now, there is an additional component I think... Storage is on an array and I 
> am not getting a clear answer as to where it is 😊 Is it possible that 
> something is happening at the storage layer? Could that be reported as a 
> network issue vs a storage issue for Postgres?

No.  If there were an error with storage, it'd be reported as a local error,
and the query would fail, rather than failing with client-server communication.

> Also, both machines are actually VMs. I forgot to mention that and not sure 
> if that's relevant.

Are they running on the same hypervisor ?  Is that hyperv ?
Lacking other good hypotheses, that does seem relevant.

-- 
Justin

Re: OOM killer while pg_restore

2022-03-03 Thread Justin Pryzby

On Thu, Mar 03, 2022 at 09:59:03AM +0100, Marc Rechté wrote:
> Hello,
> 
> We have a pg_restore which fails due to RAM over-consumption of the
> corresponding PG backend, which ends-up with OOM killer.
> 
> The table has one PK, one index, and 3 FK constraints, active while restoring.

Send the schema for the table, index, and constraints (\d in psql).

What are the server settings ?
https://wiki.postgresql.org/wiki/Server_Configuration

What OS/version ?

> The dump contains over 200M rows for that table and is in custom format,
> which corresponds to 37 GB of total relation size in the original DB.
> 
> While importing, one can see the RSS + swap increasing linearly for the
> backend (executing the COPY)
> 
> On my machine (quite old PC), it failed after 16 hours, while the disk usage
> was reaching 26 GB and memory usage was 9.1g (RSS+swap)

Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-03-03 Thread Justin Pryzby

On Thu, Mar 03, 2022 at 01:33:08PM -0300, Ranier Vilela wrote:
> Sorry, but this is much more on the client side.

The client is reporting the problem, as is the server.

> Following the logs, it is understood that the client is dropping the
> connection.

The logs show that the client's connection *was* dropped.
And on the server, the same.

> So most likely the error could be from Pentaho or JDBC.
> 
> https://www.geeksforgeeks.org/java-net-socketexception-in-java-with-examples/
> " This *SocketException* occurs on the server-side when the client closed
> the socket connection before the response could be returned over the
> socket."
> 
> I suggest moving this thread to the Pentaho or JDBC support.

We don't know the source of the problem.  I still doubt it's in postgres, but I
don't think it's helpful to blame the client, just because the client reported
the problem.  If the server were to disconnect abruptly, I'd expect the client
to report that, too.

Laurent would just have to start the conversation over (and probably collect
the same diagnostic information anyway).  The client projects could blame
postgres with as much rationale as there is for us to blame the client.

Please don't add confusion here.  I made suggestions for how to collect more
information to better understand the source of the problem, and there's
probably not much else to say without that.

-- 
Justin

Re: Optimal configuration for server

2022-03-07 Thread Justin Pryzby

On Mon, Mar 07, 2022 at 08:51:24AM -0300, Luiz Felipph wrote:
> My current problem:
> 
> under heavyload, i'm getting "connection closed" on the application
> level(java-jdbc, jboss ds)

Could you check whether the server is crashing ?

If you run "ps -fu postgres", you can compare the start time ("STIME") of the
postmaster parent process with that of the persistent, auxilliary, child
processes like the checkpointer.  If there was a crash, the checkpointer will
have restarted more recently than the parent process.

The SQL version of that is like:
SELECT date_trunc('second', pg_postmaster_start_time() - backend_start) FROM 
pg_stat_activity ORDER BY 1 DESC LIMIT 1;

-- 
Justin

Re: Using system tables directly takes many hours, using temp tables with no indexes takes a few seconds for geometry_columns view.

2022-03-23 Thread Justin Pryzby

On Wed, Mar 23, 2022 at 09:44:09AM +, Lars Aksel Opsahl wrote:
> Why is temp tables with no indexes much faster system tables with indexes ?

I think the "temp table" way is accidentally faster due to having no
statistics, not because it has no indexes.  If you run ANALYZE, you may hit the
same issue (or, maybe you just need to VACUUM ANALYZE your system catalogs).

-- 
Justin

Re: High process memory consumption when running sort

2022-03-23 Thread Justin Pryzby

On Wed, Mar 23, 2022 at 02:42:06PM +, Shai Shapira wrote:
> Hi,
> 
> When running our application, we noticed that some processes are taking a lot 
> of memory ( 10, 15, 20GB or so, of RSS ).
> It is also reproduced when running in psql.

Note that RSS can include shared_buffers read by that backend.
That's a linux behavior, not specific to postgres.  It's what Andres was
describing here:
https://www.postgresql.org/message-id/flat/20201003230149.mtd7fjsjwgii3...@alap3.anarazel.de

You have effective_cache_size = 48GB, so this seems to be working as intended.
(ecc is expected to include data cached not only by postgres but by the OS page
cache, too).

> Memory consumption: ( of case 2, application table, using system_stats )

I'm not sure, but I guess this is just a postgres view of whatever the OS
shows.

> Using top:
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 15298 postgres  20   0   16.8g   1.1g   1.1g S   0.0  1.7   0:02.63 postgres

> PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 
> (Red Hat 4.8.5-44), 64-bit
> Linux illin7504 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 11 19:12:04 EDT 
> 2020 x86_64 x86_64 x86_64 GNU/Linux

> shared_buffers  | configuration file   | postmaster   
>  | 2097152| 8kB  | 1024
> effective_cache_size| configuration file   | user 
>  | 6291456| 8kB  | 524288
> work_mem| configuration file   | user 
>  | 20480  | kB   | 4096

Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join

2022-03-24 Thread Justin Pryzby

On Tue, Mar 22, 2022 at 12:57:10PM +, Prajna Shetty wrote:
> 1.  We have performed Vacuum/Analyze/Reindex post Upgrade.
> 2.  Tweaked work_mem so it does not spill to Disk. We can Disk Usage But 
> it is still using Hash Aggregate and came down from 5 minutes to 20 seconds. 
> (Expected ~5 seconds). Attached plan after modifying work_mem
> 3.  Disabled Seqcan/ nestedloop
> 4.  Tweaked random_page_cost/seq_page_cost
> 5.  Set default_statistics_target=1000 and then run 
> vacuum(analyze,verbose) on selected tables.
> 6.  We have also tested performance by increasing resources up to 4 vCPU 
> and 32 GB RAM.

Would you provide your current settings ?
https://wiki.postgresql.org/wiki/Server_Configuration

Re: HIGH IO and Less CPU utilization

2022-03-29 Thread Justin Pryzby

Hi,

Thanks for providing all this info.

On Tue, Mar 29, 2022 at 11:34:18PM +0530, Rambabu g wrote:
> Hi All,
> 
> We have an issue with high load and IO Wait's but less cpu on postgres
> Database, The emp Table size is around 500GB, and the connections are very
> less.

What indexes are defined on this table ?
How large are they ?

> Red Hat Enterprise Linux Server release 7.9 (Maipo)
>  PostgreSQL 11.15 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
> 20150623 (Red Hat 4.8.5-44), 64-bit
> 
> shared_buffers=12GB
> work_mem=128MB

> 14428 | 04:45:59.712892 | active  | INSERT INTO target (empno, name)
> SELECT empno, '' AS name FROM (select distinct  empno  from emp where 
> sname='test'
> and tp='EMP NAME 1' LIMIT 10) AS query   ;

Is the only only problem query, or just one example or ??
Are your issues with loading data, querying data or both ?

> -bash-4.2$ iostat -x

It shows that you only have a few filesystems in use.
It's common to have WAL and temp_tablespaces on a separate FS.
That probably wouldn't help your performance at all, but it would help to tell
what's doing I/O.  Is there anything else running on the VM besides postgres ?

You can also check:
SELECT COUNT(1), wait_event FROM pg_stat_activity GROUP BY 2 ORDER BY 1 DESC;

And the pg_buffercache extension:
SELECT COUNT(nullif(isdirty,'f')) dirty, COUNT(1) all, COALESCE(c.relname, 
b.relfilenode::text) FROM pg_buffercache b LEFT JOIN pg_class c ON 
b.relfilenode=pg_relation_filenode(c.oid) GROUP BY 3 ORDER BY 1 DESC,2 DESC 
LIMIT 9;

> Hypervisor vendor: KVM

Are KSM or THP enabled on the hypervisor ?

tail /sys/kernel/mm/ksm/run 
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag 
/sys/kernel/mm/transparent_hugepage/enabled 
/sys/kernel/mm/transparent_hugepage/defrag  
 

-- 
Justin

Re: HIGH IO and Less CPU utilization

2022-03-29 Thread Justin Pryzby

On Wed, Mar 30, 2022 at 12:52:05AM +0530, Rambabu g wrote:
> > What indexes are defined on this table ?
> > How large are they ?
>
> There are three indexes defined on the table, each one is around 20 to 25GB
> and the indexes is create on

Did you mean to say something else after "on" ?

Show the definition of the indexes from psql \d

> postgres=# explain select distinct  empno  from emp where sname='test' and 
> tp='EMP NAME 1'

Is this the only query that's performing poorly ?
You should send explain (analyze,buffers) for the prolematic queries.

> > > Hypervisor vendor: KVM
> >
> > Are KSM or THP enabled on the hypervisor ?

> No, the Ec2 VM is delicate to postgres DB instances only.

Oh, so this is an EC2 and you cannot change the hypervisor itself.

> -bash-4.2$ tail /sys/kernel/mm/ksm/run 
> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag 
> /sys/kernel/mm/transparent_hugepage/enabled 
> /sys/kernel/mm/transparent_hugepage/defrag
...
> ==> /sys/kernel/mm/transparent_hugepage/defrag <==
> [always] madvise never

I doubt it will help, but you could try disabling these.
It's a quick experiment anyway.

Re: HIGH IO and Less CPU utilization

2022-03-30 Thread Justin Pryzby

On Wed, Mar 30, 2022 at 10:17:38AM +0530, Rambabu g wrote:
> Hi Justin,
> 
> Only one query is causing the issue, sharing the def of indexes. Please
> have a look.
> 
> > > There are three indexes defined on the table, each one is around 20 to 
> > > 25GB
> 
>  tp| character varying(2000)   | yes| tp   | extended |
> 
>852 | 00:09:56.131136 | IO  | DataFileRead  | explain
> analyze select distinct  empno  from emp where sname='test' and tp='EMP
> NAME 1'

The server is doing a scan of the large table.
The tp index matches a lot of rows (13e6) which probably aren't clustered, so
it elects to scan the 500GB table each time.

Looking at this in isolation, maybe it'd be enough to create an index on
tp,empno (and maybe drop the tp index).  CREATE INDEX CONCURRENTLY if you don't
want to disrupt other queries.

But This seems like something that should be solved in a better way though ;
like keeping a table with all the necessary "empno" maintained with "INSERT ON
CONFLICT DO NOTHING".  Or a trigger.

Re: An I/O error occurred while sending to the backend (PG 13.4)

2022-04-13 Thread Justin Pryzby

On Wed, Apr 13, 2022 at 03:36:19PM +, l...@laurent-hasson.com wrote:
> After a lot of back and forth, someone in IT informed us that the database VM 
> is under a backup schedule using Veeam. Apparently, during the backup window, 
> Veeam creates a snapshot and that takes the VM offline for a couple of 
> minutes… And of course, they scheduled this right at the busiest time of the 
> day for this machine which is during our nightly ETL. Their backup doesn’t 
> perform very week either, which explained why the failure seemed to randomly 
> happen at various points during our ETL (which takes about 2h30mn).
> 
> They moved the schedule out and the issue has not happened again over the 
> past 3 weeks. This looks like it was the root cause and would explain (I 
> think) how the database and the client simultaneously reported a connection 
> timeout.
> 
> Thank you so much for all your help in trying to figure this out and 
> exonerate Postgres.

Great, thanks for letting us know.
This time it wasn't postgres' fault; you're 2 for 3 ;)

One issue I've seen is if a vmware snapshot is taken and then saved for a long
time.  It can be okay if VEEM takes a transient snapshot, copies its data, and
then destroys the snapshot.  But it can be bad if multiple snapshots are taken
and then left around for a long time to use as a backup themselves.

-- 
Justin

Re: Query Tunning related to function

2022-04-16 Thread Justin Pryzby

On Thu, Apr 14, 2022 at 06:03:33AM +, Kumar, Mukesh wrote:
> We are running the below query in PostgreSQL and its taking approx. 8 to 9 
> sec to run the query.
> 
> Query - 1 ...
> 
> The explain plan and other details are placed at below link for more 
> information. We have checked the indexes on column but in the explain plan it 
> is showing as Seq Scan which we have to find out.
> 
> https://explain.depesz.com/s/Jsiw#stats

There's a list of information to provide in the postgres wiki, and people here
sent you a link to that wiki page on Feb 27, Mar 1, and Apr 12.  Your problem
report is still missing a lot of that information.  Asking for it piece by
piece would be tedious.

-- 
Justin

Re: significant jump in sql statement timing for on server vs a remote connection

2022-04-19 Thread Justin Pryzby

On Tue, Apr 19, 2022 at 03:00:09PM -0600, Sbob wrote:
> We are debugging a sql performance issue. We have a sql file with 50,000
> simple select statements in it. If I run the file locally it completes in
> less than 15sec.  If I force the local connection to be a tcp/ip connection
> via psql -h and I get approximately the same results, 15 - 16sec.
> 
> 
> However if we move the file to another server in the same network and run
> with a psql -h then it runs for more than 10min. Are there any postgres
> specific issues / settings / connection overhead  we should look at? Or is
> this simply a network issue and fully outside the scope of the postgres
> database?

What OS ?  What kind of authentication are you using ?
Is there a connection pooler involved ?  Did you try like that ?

Did you test how long it takes to run 10k empty statements locally vs remotely ?
time yes 'SELECT;' |head - |psql ... >/dev/null

-- 
Justin

Re: Query Planner not taking advantage of HASH PARTITION

2022-04-21 Thread Justin Pryzby

On Wed, Apr 20, 2022 at 07:11:37PM -0700, Benjamin Tingle wrote:
> @ the first point about write locks
> I think I had/have a misconception about how inserts work in postgres. It's
> my understanding that postgres will never draft a parallel insert plan for
> any query (except maybe CREATE TABLE AS?)

It's correct that DML (INSERT/UPDATE/DELETE) currently is not run in parallel.
https://www.postgresql.org/docs/current/when-can-parallel-query-be-used.html

> because the process needs to acquire an exclusive access write lock to the
> table it is inserting on.

But this is incorrect - DML acquires a relation lock, but not a strong one.
Multiple processes can insert into a table at once (because the row-excl lock
level is not self-conflicting, to be technical).
https://www.postgresql.org/docs/current/explicit-locking.html

In fact, that's a design requirement.  It's understood that many people would
be unhappy if only one client were able to run UPDATEs at once, and that only a
toy system would acquire a strong lock for DML.

-- 
Justin

Re: DB connection issue suggestions

2022-05-10 Thread Justin Pryzby

On Wed, May 11, 2022 at 12:59:01AM +0800, Sudhir Guna wrote:
>  Dear All,
> 
> We have recently upgraded Postgresql 9.4 standalone server to Postgresql
> 11.2 with High Availability (2 servers : Master and Standby).
> 
> While trying to test using ETL applications and reports, we observe that
> the ETL jobs fails with below error,
> 
> 2022/05/06 16:27:36 - Error occurred while trying to connect to the database
> 2022/05/06 16:27:36 - Error connecting to database: (using class
> org.postgresql.Driver)
> 2022/05/06 16:27:36 - FATAL: Sorry, too many clients already
> 
> We have increased the max_connections = 1000 in postgresql.conf file.
> 
> It worked ok for a day and later we get the same error message.
> 
> Please help to advise on any additional settings required. The prior
> Postgresql 9.4 had the default max_connections = 100 and the applications
> worked fine.

It sounds like at least one thing is still running, perhaps running very
slowly.

You should monitor the number of connections to figure out what.

If you expect to be able to run with only 100 connections, then when
connections>200, there's already over 100 connections which shouldn't still be
there.

You could query pg_stat_activity to determine what they're doing - trying to
run a slow query ?  Are all/most of them stuck doing the same thing ?

You should try to provide the information here for the slow query, and for the
rest of your environment.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

-- 
Justin

Re: DB connection issue suggestions

2022-05-11 Thread Justin Pryzby

On Wed, May 11, 2022 at 09:52:10AM +0800, Sudhir Guna wrote:
> Hi Justin,
> 
> Thank you for reviewing.
> 
> I have tried to run the below query and could see only less than 5
> connections active when I get this error. The total rows I see is only 10
> including idle and active sessions for this output.

That doesn't sound right.  Are you sure you're connecting to the correct
instance ?  Are there really only 5 postgres processes on the server, and fewer
than 5 connections to its network port or socket ?

You didn't provide any other info like what OS this is.

-- 
Justin

Re: DB connection issue suggestions

2022-05-12 Thread Justin Pryzby

If the problem occurs gradually (like leaking 20 connections per hour during
ETL), you can check pg_stat_activity every hour or so to try to observe the
problem before all the connection slots are used up, to collect diagnostic
information.

Alternately, leave a connection opened to the DB and wait until all connection
slots *are* used up, and then check pg_stat_activity.  That will take longer,
and you'll have more information to weed through.

What messages are in the server's log ?

v11.2 is years old and hundreds of bugfixes behind.  Since you ran into this
problem anyway, why not run 11.16, which was released today ?

How did you install postgres 11 ?  From source or from packages ?  Which
packages ?  The semi-official PGDG RPM packages are available here:
https://yum.postgresql.org/

-- 
Justin

Re: How to monitor Postgres real memory usage

2022-05-24 Thread Justin Pryzby

On Wed, May 25, 2022 at 12:25:28AM +0800, 徐志宇徐 wrote:
> Hi All
> 
>   I am a Database DBA. I focus on PostgreSQL and DB2.
>   Recently. I experience some memory issue. The postgres unable allocate
> memory. I don't know how to monitor Postgres memory usage.

Postgres is just an OS Process, so should be monitored like any other.

What OS are you using ?

Know that the OS may attribute "shared buffers" to different processes, or
multiple processes.

>  This server have 16G memory. On that time. The free command display only 3
> G memory used. The share_buffers almost 6G.
> 
>  On that time. The server have 100 active applications.
>  New connection failed. I have to kill some application by os command "kill 
> -9"

It's almost always a bad idea to kill postgres with kill -9.

> The checkpoint command execute very slow. almost need 5-10 seconds.

Do you mean an interactive checkpoint command ?
Or logs from log_checkpoint ?

>   Is there any useful command to summary PostgreSQL memory usage ?

You can check memory use of an individual query with "explain (analyze,buffers) 
.."
https://wiki.postgresql.org/wiki/Slow_Query_Questions

What settings have you used in postgres ?
https://wiki.postgresql.org/wiki/Server_Configuration

What postgres version ?
How was it installed ?  From souce?  From a package ?

-- 
Justin

Re: How to monitor Postgres real memory usage

2022-05-26 Thread Justin Pryzby

> enable_seqscan = 'off'

Why is this here ?  I think when people set this, it's because they "want to
use more index scans to make things faster".  But index scans aren't
necessarily faster, and this tries to force their use even when it will be
slower.  It's better to address the queries that are slow (or encourage index
scans by decreasing random_page_cost).

> maintenance_work_mem = '64MB'
> autovacuum_max_workers = '20'
> vacuum_cost_limit = '2000'
> autovacuum_vacuum_scale_factor = '0.0002'
> autovacuum_analyze_scale_factor = '0.1'

This means you're going to use up to 20 processes simultaneously running vacuum
(each of which may use 64MB memory).  What kind of storage does the server
have?  Can it support 20 background processes reading from disk, in addition to
other processs ?

Justin Pryzby  于2022年5月25日周三 01:40写道：
> > What postgres version ?
> > How was it installed ?  From souce?  From a package ?

What about this ?

I'm not sure how/if this would affect memory allocation, but if the server is
slow, processes will be waiting longer, rather than completing quickly, and
using their RAM for a longer period...

Does the postgres user have any rlimits set ?

Check:
ps -fu postgres
# then:
sudo cat /proc/2948/limits

Re: How to monitor Postgres real memory usage

2022-05-26 Thread Justin Pryzby

On Fri, May 27, 2022 at 01:39:15AM +0800, 徐志宇徐 wrote:
> Hi Justin
> 
>   Thanks for you explaination.
> 
>   > > What postgres version ?
>   > > How was it installed ?  From souce?  From a package ?
>   I am using Postgres 11.1 .It's installed by package.

This is quite old, and missing ~4 years of bugfixes.

What's the output of these commands?
tail /proc/sys/vm/overcommit_*
tail /proc/sys/vm/nr_*hugepages /proc/cmdline
cat /proc/meminfo
uname -a

-- 
Justin

Re: How to monitor Postgres real memory usage

2022-05-27 Thread Justin Pryzby

On Sat, May 28, 2022 at 01:40:14AM +0800, 徐志宇徐 wrote:
> vm.swappiness=0

I think this is related to the problem.

swappiness=0 means to *never* use swap, even if that means that processes are
killed.

If you really wanted that, you should remove the swap space.

Swap is extremely slow and worth avoiding, but this doesn't let you use it at
all.  You can't even look at your swap usage as a diagnostic measure to tell if
things had been paged out at some point.

I *suspect* the problem will go away if you set swappiness=1 in /proc (and in
sysctl.conf).

-- 
Justin

Re: postgres backend process hang on " D " state

2022-05-29 Thread Justin Pryzby

On Sun, May 29, 2022 at 01:20:12PM +, James Pang (chaolpan) wrote:
>We have a performance test on Postgresql 13.4 on RHEL8.4 , just after 
> connection storm in ( 952 new connections coming in 1 minute),  a lot of 
> backends start on " D " state,  and when more sessions got disconnected, they 
> do not exit successfully, instead became  "defunct".   No errors from 
> postgresql.log , just after the connection storm, some pg_cron workers can 
> not started either.  The server is a Virtual machine and no IO hang 
> (actually) IO load is very low.   Could be a postgresql bug or an OS abnormal 
> behavior?

What settings have you set ?
https://wiki.postgresql.org/wiki/Server_Configuration

What extensions do you have loaded?  \dx

Send the output of SELECT * FROM pg_stat_activity either as an attachment or in
\x mode?

What is your data dir ?  Is it on the VM's root filesystem or something else ?
Show the output of "mount".  Are there any kernel messages in /var/log/messages
or `dmesg` ?

How many relations are in your schema ?
Are you using temp tables ?
Long-running transactions ?

-- 
Justin

Re: postgres backend process hang on " D " state

2022-05-29 Thread Justin Pryzby

On Mon, May 30, 2022 at 01:19:56AM +, James Pang (chaolpan) wrote:
> 1. extensions 
>   shared_preload_libraries = 
> 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user'
> 2.  psql can not login now ,it hang there too, so can not check anything from 
> pg_stats_* views
> 3.  one main app user and 2 schemas ,no long running transactions . 
> 4. we use /pgdata , it's on xfs , lvm/vg RHEL8.4 ,it's a shared storage, no 
> use root filesystem.
> /dev/mapper/pgdatavg-pgdatalv 500G  230G  271G  46% /pgdata
> /dev/mapper/pgdatavg-pgarchivelv  190G  1.5G  189G   1% /pgarchive
> /dev/mapper/pgdatavg-pgwallv  100G   34G   67G  34% /pgwal

What are the LVM PVs ?  Is it a scsi/virt device ?  Or iscsi/drbd/???

I didn't hear back if there's any kernel errors.
Is the storage broken/stuck/disconnected ?
Can you run "time find /pgdata /pgarchive /pgwal -ls |wc" ?

Could you run "ps -u postgres -O wchan="

Can you strace one of the stuck backends ?

It sounds like you'll have to restart the service or VM (forcibly if necessary)
to resolve the immediate issue and then collect the other info, and leave a
"psql" open to try to (if the problem recurs) check pg_stat_activity and other
DB info.

Re: Query is taking too long i intermittent

2022-06-06 Thread Justin Pryzby

On Mon, Jun 06, 2022 at 03:28:43PM +0530, Mayank Kandari wrote:
> 

Thanks for including the link*.

(*FYI, I find it to be kind of unfriendly to ask the same question in multiple
forums, simultaneously - it's like cross-posting.  The goal seems to be to
demand an answer from the internet community as quickly as possible.)

> Indexes:
> "event_pkey" PRIMARY KEY, btree (event_id, pkey)
> "event_event_sec_event_usec_idx" btree (event_sec, event_usec)
> When I execute the following query it takes 1 - 2 milliseconds to execute.

> I am using libpq to connect the Postgres server in c++ code. Postgres
> server version is 12.10
> Time is provided as a parameter to function executing this query, it
> contains epoche seconds and microseconds.

Are you using the simple query protocol or the extended protocol ?

> This query is executed every 30 seconds on the same client connection
> (Which is persistent for weeks). This process runs for weeks, but some time
> same query starts taking more than 10 minutes. Once it takes 10 minutes,
> after that every execution takes > 10 minutes.

> If I restart the process it recreated connection with the server and now
> execution time again falls back to 1-2 milliseconds. This issue is
> intermittent, sometimes it triggers after a week of the running process and
> sometime after 2 - 3 weeks of the running process.

Could you get the query plan for the good vs bad executions ?

To get the "bad" plan, I suggest to enable auto-explain and set its min
duration to 10 seconds or 1 minute.  The "good" plan you can get any time from
psql.

> SELECT event_id FROM event WHERE (event_sec > time.seconds) OR
> ((event_sec=time.seconds) AND (event_usec>=time.useconds) ORDER BY
> event_sec, event_usec LIMIT 1

I think it'd be better if the column was a float storing the fractional number
of seconds.  Currently, it may be hard for the planner to estimate rowcounts if
the conditions are not independent.  I don't know if it's related to this
problem, though.

Re: Strange behavior of limit clause in complex query

2022-06-08 Thread Justin Pryzby

On Wed, Jun 08, 2022 at 09:44:08AM +0100, Paulo Silva wrote:
> But if I add an ORDER BY and a LIMIT something goes very wrong (Q2):

A somewhat common problem.

A common workaround is to change "ORDER BY a" to something like "ORDER BY a+0"
(if your framework will allow it).

> An EXPLAIN (ANALYZE, BUFFERS) for Q2 returns this:
...
>->  Index Scan Backward using ix_ng_content_date on ng_content 
> "Extent1"  (cost=0.43..40616715.85 rows=2231839 width=12) (actual 
> time=11027.808..183839.289 rows=5 loops=1)
>  Filter: ((2 = id_status) AND (date_from <= 
> LOCALTIMESTAMP) AND (date_to >= LOCALTIMESTAMP) AND (SubPlan 1))
>  Rows Removed by Filter: 4685618
>  Buffers: shared hit=15414533 read=564480 written=504

I'm not sure if it would help your original issue, but the rowcount estimate
here is bad - overestimating 2231839 rows instead of 5.

Could you try to determine which of those conditions (id_status, date_from,
date_to, or SubPlan) causes the mis-estimate, or if the estimate is only wrong
when they're combined ?

-- 
Justin

Re: Fluctuating performance of updates on small table with trigger

2022-06-29 Thread Justin Pryzby

On Wed, Jun 29, 2022 at 09:31:58PM +0200, Mikkel Lauritsen wrote:
> In short I'm running PostgreSQL 14.1 on Linux on a small test machine with

should try to upgrade to 14.4, for $reasons

> Is there any feasible way to find out what it is that causes Postgres to
> start doing slow updates? My guess would be a buffer filling up or something
> similar, but the regularity between runs paired with the irregular lengths
> of the fast and slow phases in each run doesn't really seem to fit with
> this.

Set log_checkpoints=on, log_autovacuum_min_duration=0, log_lock_waits=on, and
enable autoexplain with auto_explain.log_nested_statements=on.

Then see what's in the logs when that happens.

@hackers: the first two of those are enabled by default in 15dev, and this
inquiry seems to support that change.

-- 
Justin

Re: partition pruning only works for select but update

2022-07-01 Thread Justin Pryzby

On Fri, Jul 01, 2022 at 08:30:40AM +, James Pang (chaolpan) wrote:
> We have other application depend on V13, possible to backport code changes to 
> V13 as 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=86dc90056dfdbd9d1b891718d2e5614e3e432f35

Do you mean that the other application needs to be updated to work with v14?
Or that you haven't checked yet if they work with v14?

In any case, I'm sure the feature won't be backpatched to v13 - it's an
improvement but not a bugfix.

-- 
Justin

> -Original Message-
> From: Tom Lane  
> Sent: Tuesday, June 28, 2022 9:30 PM
> To: James Pang (chaolpan) 
> Cc: pgsql-performance@lists.postgresql.org
> Subject: Re: partition pruning only works for select but update
> 
> "James Pang (chaolpan)"  writes:
> > But when
> > Explain update table set .. where  partitionkey between  to_timestamp() and 
> > to_timestamp();
> >   It still show all of partitions with update ...
> 
> In releases before v14, partition pruning is far stupider for UPDATE (and 
> DELETE) than it is for SELECT.

Re: Occasional performance issue after changing table partitions

2022-07-10 Thread Justin Pryzby

On Sun, Jul 10, 2022 at 04:55:34PM +1200, Nathan Ward wrote:
> I am running Postgres 13 on CentOS 7, installed from the yum.postgresql.org 
>  repo.

It doesn't sound relevant, but what kind of storage systems is postgres using ?
Filesystem, raid, device.

Is the high CPU use related to to autovacuum/autoanalyze ?

> The issue I am having, is that when the daily data usage aggregation runs, 
> sometimes we have a big performance impact, with the following 
> characteristics which happen *after* the aggregation job runs in it usual 
> fast time of 12s or so:
> - The aggregation runs fast as per normal
> - Load on the server goes to 30-40 - recall we have quite high “max 
> connections” to keep throughput high when the client is far (16ms) from the 
> server

I suggest to install and enable autoexplain to see what's running slowly here,
and what its query plans are.  It seems possible that when the daily
aggregation script drops the old partitions, the plan changes for the worse.
I'm not sure what the fix is - maybe you just need to run vacuum or analyze on
the new partitions soon after populating them.

For good measure, also set log_autovacuum_min_duration=0 (or something other
than -1) (and while you're at it, log_checkpoints=on, and log_lock_waits=on if
you haven't already).

Note that postgres doesn't automatically analyze parent tables, so you should
maybe do that whenever the data changes enough for it to matter.

-- 
Justin

Re: Occasional performance issue after changing table partitions

2022-07-10 Thread Justin Pryzby

On Mon, Jul 11, 2022 at 03:21:38PM +1200, Nathan Ward wrote:
> > Note that postgres doesn't automatically analyze parent tables, so you 
> > should
> > maybe do that whenever the data changes enough for it to matter.
> 
> Hmm. This raises some stuff I’m not familiar with - does analysing a parent 
> table do anything?

Yes

You could check if you have stats now (maybe due to a global ANALYZE or
analyzedb) and how the query plans change if you analyze.
The transaction may be overly conservative.

SELECT COUNT(1) FROM pg_stats WHERE tablename=PARENT;
SELECT last_analyze, last_autoanalyze, relname FROM pg_stat_all_tables WHERE 
relname=PARENT;
begin;
SET default_statistics_target=10;
ANALYZE;
explain SELECT [...];
rollback;

> I got the impression that analysing the parent was just shorthand for 
> analysing all of the attached partitions.

Could you let us know if the documentation left that impression ?

See here (this was updated recently).

https://www.postgresql.org/docs/13/sql-analyze.html#id-1.9.3.46.8

For partitioned tables, ANALYZE gathers statistics by sampling rows from all 
partitions; in addition, it will recurse into each partition and update its 
statistics. Each leaf partition is analyzed only once, even with multi-level 
partitioning. No statistics are collected for only the parent table (without 
data from its partitions), because with partitioning it's guaranteed to be 
empty.

By contrast, if the table being analyzed has inheritance children, ANALYZE 
gathers two sets of statistics: one on the rows of the parent table only, and a 
second including rows of both the parent table and all of its children. This 
second set of statistics is needed when planning queries that process the 
inheritance tree as a whole. The child tables themselves are not individually 
analyzed in this case.

The autovacuum daemon does not process partitioned tables, nor does it process 
inheritance parents if only the children are ever modified. It is usually 
necessary to periodically run a manual ANALYZE to keep the statistics of the 
table hierarchy up to date.

> Perhaps because I attach a table with data, the parent sometimes decides it 
> needs to run analyse on a bunch of things?

No, that doesn't happen.

> Or, maybe it uses the most recently attached partition, with bad statistics, 
> to plan queries that only touch other partitions?

This is closer to what I was talking about.

To be clear, you are using relkind=p partitions (added in v10), and not
inheritance parents, right ?

-- 
Justin

Re: Oracle_FDW table performance issue

2022-07-11 Thread Justin Pryzby

On Mon, Jul 11, 2022 at 05:38:34PM +0530, aditya desai wrote:
> Hi,
> I have one Oracle fdw table which is giving performance issue when joined
> local temp table gives performance issue.
> 
> select * from oracle_fdw_table where transaction_id in ( select
> transaction_id from temp_table)  54 seconds. Seeing HASH SEMI JOIN  in
> EXPLAIN PLAN. temp_table has only 74 records.

You'd have to share the plan

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Do the tables have updated stats ?

Re: Occasional performance issue after changing table partitions

2022-07-12 Thread Justin Pryzby

On Wed, Jul 13, 2022 at 03:13:46AM +1200, Nathan Ward wrote:
> I have been stepping through the various statements which are different 
> between the two functions, and note that when I do math on a timestamp in a 
> SELECT statement (i.e. _event_timestamp - INTERVAL ‘1 hour’),
> the planner takes 50ms or so - note that the result of the timestamp is used 
> to search the partition key.
> If I declare a function which does the math in advance, stores it in a 
> variable and then runs the SELECT, the planner takes less than 1ms.
> Does this mean it’s calculating the timestamp for each partition, or 
> something like that?

I'm not sure I understand what you're doing - the relevant parts of your
function text and query plan would help here.

Maybe auto_explain.log_nested_statements would be useful ?

Note that "partition pruning" can happen even if you don't have a literal
constant.  For example:
|explain(costs off) SELECT * FROM metrics WHERE start_time > now()::timestamp - 
'1 days'::interval;
| Append
|   Subplans Removed: 36

> I see Postgres 14 release notes has information about performance 
> improvements in the planner for updates on tables with "many partitions”. Is 
> 444 partitions “many”?
> My updates are all impacting a single partition only.

It sounds like that'll certainly help you.  Another option is to update the
partition directly (which is what we do, to be able to use "ON CONFLICT").

I think with "old partitioning with inheritance", more than a few hundred
partitions was considered unreasonable, and plan-time suffered.

With relkind=p native/declarative partitioning, a few hundred is considered
reasonable, and a few thousand is still considered excessive - even if the
planner time is no issue, you'll still run into problems like "work-mem is
per-node", which works poorly when you might have 10x more nodes.

TBH, this doesn't sound related to your original issue.

-- 
Justin

Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries

2022-07-21 Thread Justin Pryzby

On Thu, Jul 21, 2022 at 02:37:35PM -0400, bruno da silva wrote:
> I'm investigating an issue on a PostgresSql 9.5.21 installation that
> becomes unusable in an intermittent way. Simple queries like "select
> now();" could take 20s. commits take 2s. and all gets fixed after an engine
> restart.
> 
> I look into the pg logs and no signs of errors. and checkpoints are
> always timed. The machine is well provisioned, load isn't too high, and cpu
> io wait is under 1%.
> 
> any suggestions on what I should check more?

What OS/version is it ?

What GUCs have you changed ?

Is it a new issue ?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Operating system+version
What OS / version ? At least for linux, you can get the distribution by 
running: tail /etc/*release 

GUC Settings
What database configuration settings have you changed? What are their values? 
(These are things like "shared_buffers", "work_mem", "enable_seq_scan", 
"effective_io_concurrency", "effective_cache_size", etc). See Server 
Configuration for a useful query that will show all of your non-default 
database settings, in an easier to read format than posting pieces of your 
postgresql.conf file. 

-- 
Justin

Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries

2022-07-21 Thread Justin Pryzby

On Thu, Jul 21, 2022 at 03:59:30PM -0400, bruno da silva wrote:
>   OS/version: CentOS release 6.9 (Final)

How are these set ?

tail /sys/kernel/mm/ksm/run 
/sys/kernel/mm/transparent_hugepage/{defrag,enabled,khugepaged/defrag} 
/proc/sys/vm/zone_reclaim_mode

I suspect you may be suffering from issues with transparent huge pages.

I suggest to disable KSM and THP, or upgrade to a newer OS.

I've written before about these:
https://www.postgresql.org/message-id/20170524155855.gh31...@telsasoft.com
https://www.postgresql.org/message-id/20190625162338.gf18...@telsasoft.com
https://www.postgresql.org/message-id/20170718180152.ge17...@telsasoft.com
https://www.postgresql.org/message-id/20191004060300.ga11...@telsasoft.com
https://www.postgresql.org/message-id/20200413144254.gs2...@telsasoft.com
https://www.postgresql.org/message-id/20220329182453.ga28...@telsasoft.com

On Thu, Jul 21, 2022 at 04:01:10PM -0400, bruno da silva wrote:
> The issue started a month ago.

Ok .. but how long has the DB been running under this environment ?

-- 
Justin

Re: Postgresql 14 partitioning advice

2022-07-27 Thread Justin Pryzby

On Wed, Jul 27, 2022 at 08:55:14AM -0400, Rick Otten wrote:
> I'm spinning up a new Postgresql 14 database where I'll have to store a
> couple years worth of time series data at the rate of single-digit millions
> of rows per day.  Since this has to run in AWS Aurora, I can't use
> TimescaleDB.

> One person I talked to said "try not to have more than 100 partitions",
> even with the latest postgresql you'll end up with a lot of lock contention
> if you go over 100 partitions.

I'm not familiar with this (but now I'm curious).  We have over 2000 partitions
in some tables.  No locking issue that I'm aware of.  One issue that I *have*
seen is if you have many partitions, you can end up with query plans with a
very large number of planner nodes, and it's hard to set
work_mem*hash_mem_multiplier to account for that.

> This person also recommended manually
> kicking off vacuums on a regular schedule rather than trusting autovacuum
> to work reliably on the partitioned tables.

They must mean *analyze*, which does not run automatically on the partitioned
tables (only the partitions).  The partitioned table is empty, so doesn't need
to be vacuumed.

> I've got several keys, besides the obvious time-key that I could partition
> on.   I could do a multi-key partitioning scheme.  Since the data is
> inbound at a relatively steady rate, if I partition on time, I can adjust
> the partitions to be reasonably similarly sized.  What is a good partition
> size?

Depends on 1) the target number of partitions; and 2) the target size for
indexes on those partitions.  More partition keys will lead to smaller indexes.
Depending on the type of index, and the index keys, to get good INSERT
performance, you may need to set shared_buffers to accommodate the sum of size
of all the indexes (but maybe not, if the leading column is timestamp).

> Since the data most frequently queried would be recent data (say the past
> month or so) would it make sense to build an archiving strategy that rolled
> up older partitions into larger ones?  ie, do daily partitions for the
> first four weeks, then come up with a process that rolled them up into
> monthly partitions for the next few months, then maybe quarterly partitions
> for the data older than a year?  (I'm thinking about ways to keep the
> partition count low - if that advice is justified.)

I think it can make sense.  I do that myself in order to: 1) avoid having a
huge *total* number of tables (which causes pg_attribute to be large, since our
tables are also "wide"); and 2) make our backups of "recent data" smaller; and
3) make autoanalyze a bit more efficient (a monthly partition will be analyzed
numerous times the 2nd half of the month, even though all the historic data
hasn't changed at all).

> Or, should I just have a single 7 Trillion row table with a BRIN index on
> the timestamp and not mess with partitions at all?

Are you going to need to DELETE data ?  Then this isn't great, and DELETEing
data will innevitably cause a lower correlation, making BRIN less effective.

-- 
Justin

Re: pg_wal filling up while running huge updates

2022-08-05 Thread Justin Pryzby

On Fri, Aug 05, 2022 at 06:00:02PM +0530, aditya desai wrote:
> Hi,
> We are doing an oracle to postgres migration(5 TB+ data). We are encoding
> and decoding BLOB data after migration and for that we are running updates
> on tables having BLOB/CLOB data. When we execute this pg_wal is filling up.

Could you please include basic information in each new thread you create ?

https://wiki.postgresql.org/wiki/Server_Configuration
https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

-- 
Justin

Re: Postgresql 14 partitioning advice

2022-08-08 Thread Justin Pryzby

On Mon, Aug 08, 2022 at 03:45:11PM -0700, Slava Mudry wrote:
> Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and
> there was a lot of locking on partition hierarchy when you add/drop
> partition tables.

Note that postgres 9 didn't have native/declarative partitioning, and most
improvements in native partitioning don't apply to legacy/inheritance
partitioning.

https://www.postgresql.org/docs/devel/ddl-partitioning.html

"Native" partitioning added in v10 tends to require stronger locks for add/drop
than legacy partitioning, since partitions have associated bounds, which cannot
overlap.  The locking is improved in v12 with CREATE+ATTACH and v14 with
DETACH CONCURRENTLY+DROP.

-- 
Justin

Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Justin Pryzby

What version of postgres ?

I wonder if you're hitting the known memory leak involving jit.
Try with jit=off or jit_inline_above_cost=-1.

-- 
Justin

Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Justin Pryzby

On Fri, Aug 12, 2022 at 07:02:36PM +, Nico Heller wrote:
> I knew I forgot something: We are currently on 13.6. When was this issue
> fixed?

There's a WIP/proposed fix, but the fix is not released.
I asked about your version because jit was disabled by default in v11.
But it's enabled by default in v12.

https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items#Older_bugs_affecting_stable_branches

-- 
Justin

Re: Postgresql JDBC process consumes more memory than psql client

2022-09-05 Thread Justin Pryzby

On Mon, Sep 05, 2022 at 12:40:46PM +, James Pang (chaolpan) wrote:
>   We run same update or delete SQL statement " DELETE FROM ... WHERE ... "  
> the table is a hash partition table (256 hash partitions). When run the sql 
> from Postgresql JDBC driver, it soon increased to 150MB memory (RES filed 
> from top command),   but when run the same SQL from psql , it only 
> consumes about 10MB memory.  UPDATE statements is similar , need 100MB 
> memory, even it delete or update 0 rows.  Any specific control about 
> Postgresql JDBC driver ?

It sounds like JDBC is using prepared statements, and partitions maybe
weren't pruned by the server.  What is the query plan from psql vs from
jdbc ?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

What version is the postgres server ?
That affects pruning as well as memory use.

https://www.postgresql.org/docs/14/release-14.html
Improve the performance of updates and deletes on partitioned tables
with many partitions (Amit Langote, Tom Lane)

This change greatly reduces the planner's overhead for such cases, and
also allows updates/deletes on partitioned tables to use execution-time
partition pruning.

Actually, this is about the same response as when you asked in June,
except that was about UPDATE.
https://www.postgresql.org/message-id/ph0pr11mb519134d4171a126776e3e063d6...@ph0pr11mb5191.namprd11.prod.outlook.com

-- 
Justin

Re: Postgresql JDBC process consumes more memory than psql client

2022-09-05 Thread Justin Pryzby

On Mon, Sep 05, 2022 at 12:52:14PM +, James Pang (chaolpan) wrote:
> Any idea how to print SQL plan from JDBC driver ? 

You could use "explain execute" on the client, or autoexplain on the
server-side.

-- 
Justin

Re: Postgresql JDBC process consumes more memory than psql client

2022-09-06 Thread Justin Pryzby

On Tue, Sep 06, 2022 at 04:15:03AM +, James Pang (chaolpan) wrote:
> We make 2 comparisions between partitioned(256 HASH) and no-partitioned(same 
> data volume,same table attributes) , do same  "UPDATE,DELETE " .
>  1. with partitioned tables , the "RES" from top command memory increased 
> quickly to 160MB and keep stable there. 
>   From auto_explain trace, we did saw  partition pruning to specific 
> partition when execution the prepared sql statement by Postgresql JDBC .
> 2. with no-partitioned tables, the "RES" from top command memory only keep 
> 24MB stable there. 
>Same auto_explain , and only table and index scan there by prepared 
> sql statement by Postgresql JDBC. 
> 3. with psql client , run the UPDATE/DELETE sql locally,  partition pruning 
> works and the "RES" memory" is much less, it's about 9MB . 
> 
> Yesterday, when workload test, a lot of Postgresql JDBC connections use 
> 150-160MB memory , so we got ERROR: out of memory

How many JDBC clients were there?

Did you use the same number of clients when you used psql ?
Otherwise it wasn't a fair test.

Also, did you try using psql with PREPARE+EXECUTE ?  I imagine memory
use would match JDBC.

It's probably not important, but if you set the log level high enough,
you could log memory use more accurately using log_executor_stats
(maxrss).

> So, looks like something with Postgresql JDBC driver lead to the high memory 
> consumption when table is partitioned , even when table is no partitioned , 
> compared with psql client, it consumes more memory.   Any suggestions to tune 
> that ?  PG V13 , OS RHEL8 , Virtua machine on VMWARE. We make 
> shared_buffers=36% physical memory ,  effective_cache_size=70%physical memory 
> , total physical memory is about 128GB.

I sent this before hoping to get answers to all the most common
questions earlier, rather than being spread out over the first handful
of emails.

https://wiki.postgresql.org/wiki/Slow_Query_Questions

version 13 point what ?
what are the other non-default gucs ?
what are the query plans ?

-- 
Justin

Re: Query is sometimes fast and sometimes slow: what could be the reason?

2022-09-14 Thread Justin Pryzby

On Wed, Sep 14, 2022 at 05:02:07PM +0200, tias...@gmx.de wrote:
> What could be the reason of a query, which is sometimes fast 
> and sometimes slow (factor >10x)?
> (running on a large table).
>  
>  

Lots of possible issues.  Is it using a different query plan ?
Collect a good plan and a bad one and compare, or send both.
Perhaps use autoexplain to do so.

Turn on logging and send as much information as you can as described
here.
https://wiki.postgresql.org/wiki/Slow_Query_Questions

Please try to configure your mail client to send text mail (instead of
or in addition to the html one).

-- 
Justin

Re: Catching up with performance & PostgreSQL 15

2022-11-28 Thread Justin Pryzby

On Mon, Nov 28, 2022 at 06:59:41PM -0800, Josh Berkus wrote:
> Hey, folks:
> 
> I haven't configured a PostgreSQL server since version 11 (before that, I
> did quite a few).
> 
> What's changed in terms of performance configuration since then?  Have the
> fundamentals of shared_buffers/work_mem/max_connections changed at all?
> Which new settings are must-tunes?
> 
> I've heard about new parallel stuff an JIT, but neither is that applicable
> to my use-case.

shared buffers is the same, but btree indexes are frequently (IME) 3x
smaller (!) since deduplication was added in v13, so s_b might not need
to be as large.

In addition to setting work_mem, you can also (since v13) set
hash_mem_multiplier.

default_toast_compression = lz4 # v14
recovery_init_sync_method = syncfs # v14
check_client_connection_interval = ... # v14
wal_compression = {lz4,zstd} # v15

Peeking at my notes, there's also: partitioning, parallel query, brin
indexes, extended statistics, reindex concurrently, ...

... but I don't think anything is radically changed :)

-- 
Justin

Re: Odd Choice of seq scan

2022-12-01 Thread Justin Pryzby

On Fri, Dec 02, 2022 at 11:52:19AM +1100, Paul McGarry wrote:
> Hi there,
> 
> I'm wondering if anyone has any insight into what might make the database
> choose a sequential scan for a query (table defs and plan below) like :

> Plan - seq scan of table:
> =
> > explain select orders.orderid FROM orders WHERE (orders.orderid IN 
> > ('546111') OR orders.orderid IN  (select orderid FROM orderstotrans WHERE 
> > (transid IN ('546111';

> Plan - Seq scan and filter of index:
> =
> > explain select orders.orderid FROM orders WHERE (orders.orderid IN 
> > ('546111') OR orders.orderid IN  (select orderid FROM orderstotrans WHERE 
> > (transid IN ('546111';

Could you show explain analyze ?

Show the size of the table and its indexes 
And GUC settings
And the "statistics" here:
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram
Maybe on both a well-behaving instance and a badly-beving instance.

-- 
Justin

Re: Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-16 Thread Justin Pryzby

On Fri, Dec 16, 2022 at 03:24:17PM +, João Paulo Luís wrote:
> Hi! Sorry to post to this mailing list, but I could not find many tips 
> working around HashAggregate issues.
> 
> In a research project involving text repetition analysis (on top of public 
> documents)
> I have a VirtualMachine (CPU AMD Epyc 7502P, 128GB RAM, 12TB HDD, 2TB SSD),
> running postgres 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
> and some tables with many rows:

> 1 - the query is making a postgresql project have 76.7 GB resident RAM usage.
> Having a WORK_MEM setting of 2GB (and "simple" COUNT() results),
> that was not expected.
> (I risk oom-killer killing my postgres as soon as I run another concurrent
> query.)

> The rows=261275 on HashAggregate  (cost=26397219.92..26399832.67 rows=261275 
> width=8) seems VERY WRONG!
> I was expecting something like rows=1.0E+09 instead.

> I would guess that HashAggregate is behaving very badly (using to much RAM 
> beyond WORK_MEM, amd also badly estimating the #rows and taking forever...)

Huge memory use sounds like what was fixed in postgres 13.

https://www.postgresql.org/docs/13/release-13.html

Allow hash aggregation to use disk storage for large aggregation result
sets (Jeff Davis)

Previously, hash aggregation was avoided if it was expected to use more
than work_mem memory. Now, a hash aggregation plan can be chosen despite
that. The hash table will be spilled to disk if it exceeds work_mem
times hash_mem_multiplier.

This behavior is normally preferable to the old behavior, in which once
hash aggregation had been chosen, the hash table would be kept in memory
no matter how large it got — which could be very large if the planner
had misestimated. If necessary, behavior similar to that can be obtained
by increasing hash_mem_multiplier.

-- 
Justin

Re: temp_file_limit?

2022-12-18 Thread Justin Pryzby

On Sun, Dec 18, 2022 at 12:48:03PM +0100, Frits Jalvingh wrote:
> Hi list,
> 
> I have a misbehaving query which uses all available disk space and then
> terminates with a "cannot write block" error. To prevent other processes
> from running into trouble I've set the following:
> 
> temp_file_limit = 100GB

> The comment in the file states that this is a per-session parameter, so
> what is going wrong here?

Do you mean the comment in postgresql.conf ?

commit d1f822e58 changed to say that temp_file_limit is actually
per-process and not per-session.

Could you send the query plan, preferably "explain analyze" (if the
query finishes sometimes) ?

log_temp_files may be helpful here.

> The query does parallelize and uses one parallel worker while executing,
> but it does not abort when the temp file limit is reached:
> 
> 345G pgsql_tmp
> 
> It does abort way later, after using around 300+ GB:
> [53400] ERROR: temporary file size exceeds temp_file_limit (104857600kB)
> Where: parallel worker

Are you sure the 345G are from only one instance of the query ?
Or is it running multiple times, or along with other queries writing
100GB of tempfiles.

It seems possible that it sometimes runs with more than one parallel
worker.  Also, are there old/stray tempfiles there which need to be
cleaned up?

-- 
Justin

Re: Fwd: temp_file_limit?

2022-12-18 Thread Justin Pryzby

On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote:
> Just to be clear: my real question is: why is temp_file_limit not
> working at the specified size? Because this is my real problem: when a
> query is dying like this it will also kill other queries because these
> are also running out of space. Even when the limit is per-process it
> should not have exceeded 200GB imo.

What OS and filesystem are in use ?

Could you list the tmpdir when it's getting huge?  The filenames include
the PID, which would indicate if there's another procecss involved, or a
bug allowed it to get huge.
sudo du --max=2 -mx ./pgsql_tmp |sort -nr

BTW, pg_ls_tmpdir() hides directories, so you shouldn't rely on it for
listing temporary directories...

One possibility is that there are files in the tmpdir, which have been
unlinked, but are still opened, so their space hasn't been reclaimed.
You could check for that by running lsof -nn |grep pgsql_tmp Any deleted
files would say things like 'DEL|deleted|inode|no such'

> BTW, if that limit is really per process instead of per
> session/query then that is a Very Bad Thing(tm), because this makes the
> limit effectively worthless - if a query can spawn 8 parallel processes
> then you can suddenly, without any form of control, again fill up that disk.

8 is the default value of max_worker_processes and max_parallel_workers,
but 2 is the default value of max_parallel_workers_per_gather.  You're
free the change the default value to balance it with the temp_file_limit
(as suggested by the earlier-mentioned commit).

-- 
Justin

Re: Fwd: temp_file_limit?

2022-12-19 Thread Justin Pryzby

On Mon, Dec 19, 2022 at 05:57:42PM +0100, Frits Jalvingh wrote:
> @justin:
> 
> Ran the query again. Top shows the following processes:
>PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+

Thanks

> root@chatelet:/d2/var/lib/postgresql/15/main/base# du --max=2 -mx
> ./pgsql_tmp |sort -nr
> 412021 ./pgsql_tmp/pgsql_tmp650830.3.fileset
> 412021 ./pgsql_tmp
> ^^^ a few seconds after this last try the query aborted:
> ERROR:  temporary file size exceeds temp_file_limit (104857600kB)
> 
> One possibility is that there are files in the tmpdir, which have been
> > unlinked, but are still opened, so their space hasn't been reclaimed.
> > You could check for that by running lsof -nn |grep pgsql_tmp Any deleted
> > files would say things like 'DEL|deleted|inode|no such'
>
> I do not really understand what you would like me to do, and when. The disk
> space is growing, and it is actual files under pgsql_tmp?

Run this during the query as either postgres or root:
| lsof -nn |grep pgsql_tmp |grep -E 'DEL|deleted|inode|no such'

Any files it lists would be interesting to know about.

> Hope this tells you something, please let me know if you would like more
> info, and again - thanks!

I think Thomas' idea is more likely.  We'd want to know the names of
files being written, either as logged by log_temp_files or from 
| find pgsql_tmp -ls
during the query.

-- 
Justin

Re: Fwd: temp_file_limit?

2022-12-19 Thread Justin Pryzby

On Mon, Dec 19, 2022 at 06:27:57PM +0100, Frits Jalvingh wrote:
> I have listed the files during that run,

> 213M -rw--- 1 postgres postgres 213M dec 19 17:46 i100of128.p0.0
> 207M -rw--- 1 postgres postgres 207M dec 19 17:46 i100of128.p1.0
> 210M -rw--- 1 postgres postgres 210M dec 19 17:49 i100of256.p0.0
> 211M -rw--- 1 postgres postgres 211M dec 19 17:49 i100of256.p1.0
> 188M -rw--- 1 postgres postgres 188M dec 19 17:53 i100of512.p0.0
[...]

I think that proves Thomas' theory.  I'm not sure how that helps you,
though...

On Mon, Dec 19, 2022 at 01:51:33PM +1300, Thomas Munro wrote:
> One possibility is that you've hit a case that needs several rounds of
> repartitioning (because of a failure to estimate the number of tuples
> well), but we can't see that because you didn't show EXPLAIN (ANALYZE)
> output (understandably if it runs out of disk space before
> completing...).  The parallel hash code doesn't free up the previous
> generations' temporary files; it really only needs two generations'
> worth concurrently (the one it's reading from and the one it's writing
> to).  In rare cases where more generations are needed it could unlink
> the older ones -- that hasn't been implemented yet.  If you set
> log_temp_files = 0 to log temporary file names, it should be clear if
> it's going through multiple rounds of repartitioning, from the names
> (...of32..., ...of64..., ...of128..., ...of256..., ...).

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

Re: Fwd: temp_file_limit?

2022-12-19 Thread Justin Pryzby

On Sun, Dec 18, 2022 at 06:29:41PM +0100, Frits Jalvingh wrote:
> By itself I'm used to bad query performance in Postgresql; our application
> only does bulk queries and Postgres quite often makes terrible plans for
> those, but with set enable_nestloop=false set always most of them at least
> execute. The remaining failing queries are almost 100% caused by bad join
> sequences; I plan to work around those by forcing the join order from our
> application. For instance, the exact same query above can also generate the
> following plan (this one was created by manually setting
> join_collapse_limit = 1, but fast variants also occur quite often when
> disabling parallelism):

I, too, ended up setting enable_nestloop=false for our report queries,
to avoid the worst-case plans.

But you should also try to address the rowcount misestimates.  This
underestimates the rowcount by a factor of 69 (or 138 in the plan you
sent today):

| (soort = 'MSL'::text) AND (code = 'DAE'::text)

If those conditions are correlated, you can improve the estimate by
adding extended stats object.

| CREATE STATISTICS s_h_sturingslabel_ssm_stats soort,code FROM 
s_h_sturingslabel_ssm; ANALYZE s_h_sturingslabel_ssm;

Unfortunately, stats objects currently only improve scans, and not
joins, so that might *improve* some queries, but it won't resolve the
worst problems:

| Hash Join (cost=22,832.23..44,190.21 rows=185 width=47) (actual 
time=159.725..2,645,634.918 rows=28,086,472,886 loops=1) 

Maybe you can improve that by adjusting the stats target or ndistinct...

-- 
Justin

Re: Fwd: temp_file_limit?

2022-12-20 Thread Justin Pryzby

On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote:
> @justin
> 
> I tried the create statistics variant and that definitely improves the
> estimate, and with that one of the "bad" cases (the one with the 82 minute
> plan) now creates a good plan using only a few seconds.
> That is a worthwhile path to follow. A bit hard to do, because those
> conditions can be anything, but I can probably calculate the ones used per
> customer and create those correlation statistics from that... It is
> definitely better than tweaking the "poor man's query hints" enable_
> 8-/ which is really not helping with plan stability either.
> 
> That will be a lot of work, but I'll let you know the results ;)

Yeah, if the conditions are arbitrary, then it's going to be more
difficult.  Hopefully you don't have too many columns.  :)

I suggest enabling autoexplain and monitoring for queries which were
slow, and retroactively adding statistics to those columns which are
most-commonly queried, and which have correlations (which the planner
doesn't otherwise know about).

You won't want to have more than a handful of columns in a stats object
(since it requires factorial(N) complexity), but you can have multiple
stats objects with different combinations of columns (and, in v14,
expressions).  You can also set a lower stats target to make the cost a
bit lower.

You could try to check which columns are correlated, either by running:
| SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1;
for different combinations of columns.

Or by creating a tentative/experimental stats object on a handful of
columns at a time for which you have an intuition about their
correlation, and then checking the calculated dependencies FROM
pg_stats_ext.  You may need to to something clever to use that for
arbitrarily columns.  Maybe this is a start.
| SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT 
(json_each_text(dependencies::text::json)).* AS dep, * FROM pg_stats_ext)dep 
WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND regexp_count(key, ',') 
< 2

-- 
Justin

Re: When you really want to force a certain join type?

2022-12-28 Thread Justin Pryzby

On Wed, Dec 28, 2022 at 10:39:14AM -0500, Gunther Schadow wrote:
> I have a complex query which essentially runs a finite state automaton
> through a with recursive union, adding the next state based on the
> previous.  This is run at 100,000 or a million start states at the same
> time, picking a new record (token), matching it to the FSA (a three-way
> join:

> There are 100s of thousands of states. This join has a HUGE fan out if it is

> I doubt that I can find any trick to give to the planner better data which
> it can then use to figure out that the merge join is a bad proposition.

> Note, for my immediate relief I have forced it by simply set
> enable_mergejoin=off. This works fine, except, it converts both into a
> nested loop, but the upper merge join was not a problem, and sometimes (most
> often) nested loop is a bad choice for bulk data. It's only for this
> recursive query it sometimes makes sense.

Maybe the new parameter in v15 would help.

https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-RECURSIVE-WORKTABLE-FACTOR
recursive_worktable_factor (floating point)

Sets the planner's estimate of the average size of the working table
of a recursive query, as a multiple of the estimated size of the
initial non-recursive term of the query. This helps the planner
choose the most appropriate method for joining the working table to
the query's other tables. The default value is 10.0. A smaller value
such as 1.0 can be helpful when the recursion has low “fan-out” from
one step to the next, as for example in shortest-path queries. Graph
analytics queries may benefit from larger-than-default values.

-- 
Justin

Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2022-12-31 Thread Justin Pryzby

On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> Hi,
> 
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
> 
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout) should
> be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.

What version postgres?  What settings have non-default values ?
What OS/version?  What environment/hardware?  VM/image/provider/...
What are the queries that are running BIND ?  What parameter types ?
Are the slow BINDs failing?  Are their paramters being logged ?
What else is running besides postgres ?  Are the DB clients local or
remote ?  It shouldn't matter, but what client library?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

-- 
Justin

Re: Advice on best way to store a large amount of data in postgresql

2023-01-09 Thread Justin Pryzby

On Sun, Jan 08, 2023 at 07:02:01AM -0500, spiral wrote:
> This table is used essentially as a key-value store; rows are accessed
> only with `mid` primary key. Additionally, inserted rows may only be
> deleted, but never updated.
> 
> We only run the following queries:
> - INSERT INTO messages VALUES (...data...);
> - SELECT * FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid IN ($1...$n);
> - SELECT count(*) FROM messages;

Great - it's good to start with the queries to optimize.

Are you using the extended query protocol with "bind" parameters, or are they
escaped and substituted by the client library ?

> So, the problem: I don't know how to best store this data in
> postgres, or what system requirements would be needed.
> Originally, this table did not contain a substantial amount of data,
> and so I stored it in the same database as our CRUD user data. However,
> as the table became larger, cache was being allocated to (mostly
> unused) historical data from the `messages` table, and I decided to
> move the large table to its own postgres instance.
> 
> At the same time, I partitioned the table, with TimescaleDB's automatic
> time-series partitioning, because our data is essentially time-series
> (`mid` values are Twitter-style snowflakes) and it was said that
> partitioning would improve performance.
> This ended up being a mistake... shared_buffers memory usage went way
> up, from the 20GB of the previous combined database to 28GB for just
> the messages database, and trying to lower shared_buffers at all made
> the database start throwing "out of shared memory" errors when running
> DELETE queries. A TimescaleDB update did improve this, but 28GB is way
> more memory than I can afford to allocate to this database - instead of
> "out of shared memory", it gets OOM killed by the system.

Can you avoid using DELETE and instead use DROP ?  I mean, can you
arrange your partitioning such that the things to be dropped are all in
one partition, to handle in bulk ?  That's one of the main reasons for
using partitioning.

(Or, as a worse option, if you need to use DELETE, can you change the
query to DELETE one MID at a time, and loop over MIDs?)

What version of postgres is it ?  Ah, I found that you reported the same thing
at least one other place.  (It'd be useful to include here that information as
well as the prior discussion with other product/vendor).

https://github.com/timescale/timescaledb/issues/5075

In this other issue report, you said that you increased
max_locks_per_transaction.  I suppose you need to increase it further,
or decrease your chunk size.  How many "partitions" do you have
(actually, timescale uses inheritance) ?

-- 
Justin

Re: Database Stalls

2023-01-30 Thread Justin Pryzby

On Mon, Jan 30, 2023 at 05:47:49PM +, Mok wrote:
> Hi,
> 
> We've started to observe instances of one of our databases stalling for a
> few seconds.
> 
> We see a spike in wal write locks then nothing for a few seconds. After
> which we have spike latency as processes waiting to get to the db can do
> so.
> 
> There is nothing in the postgres logs that give us any clues to what could
> be happening, no locks, unusually high/long running transactions, just a
> pause and resume.
> 
> Could anyone give me any advice as to what to look for when it comes to
> checking the underlying disk that the db is on?

What version postgres?  What settings have non-default values ? 


What OS/version?  What environment/hardware?  VM/image/provider/... 



Have you enabled logging for vacuum/checkpoints/locks ?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread Justin Pryzby

On Wed, Feb 01, 2023 at 11:22:47AM -0800, Alex Kaiser wrote:
> I've never messed around with extended statistics, but I'm not sure how
> they would help here. From what I've read they seem to help when your query
> is restricting over multiple columns. Since this query is only on one
> column I'm not sure what a good "CREATE STATISTICS ..." command to run
> would be to improve the query plan. Any suggestions?

They wouldn't help.  It seems like that was a guess.

> As for how I found 'force_parallel_mode', I think I found it first here:
> https://postgrespro.com/list/thread-id/2574997 and then I also saw it when
> I was searching for 'parallel' on https://postgresqlco.nf .

Yeah.  force_parallel_mode is meant for debugging, only, and we're
wondering how people end up trying to use it for other purposes.

http://rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html

Did you try adjusting min_parallel_index_scan_size /
min_parallel_table_scan_size ?

-- 
Justin

Re: BRIN index worse than sequential scan for large search set

2023-02-24 Thread Justin Pryzby

On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> Hello everyone,
> 
> I'm playing around with BRIN indexes so as to get a feel for the feature.
> During my tests, I was unable to make BRIN indexes perform better than a
> sequential scan for queries searching for large value sets (20K values in
> the example down below).

> And now let's query 20K random rows from our 20M total rows:

I didn't try your test, but I think *random* is the problem/explanation.

> By default, this query will not use the BRIN index and simply run a 1.5s
> long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> 8.7s query time:
> https://explain.dalibo.com/plan/46c3191g8a6c1bc7

> If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> OFF;`) the same query will now take 50s with 2.5s spent on the bitmap index
> scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> (hitting 20 GB of data!):
> https://explain.dalibo.com/plan/7f73bg9172a8b226

That means the planner's cost model correctly preferred a seq scan.

> So I had the following two questions:
> 
>1. Why is the BRIN index systematically worse than a sequential scan,
>even when the table is x1000 larger than the search set, physically
>pre-sorted, dense (fillfactor at 100%) and the search rows are themselves
>sorted?

The table may be dense, but the tuples aren't.  You're asking to return
1/1000th of the tuples, across the entire table.  Suppose there are ~100
tuples per page, and you need to read about every 10th page.  It makes
sense that it's slow to read a large amount of data nonsequentially.
That's why random_page_cost is several times higher than seq_page_cost.

I would expect brin to win if the pages to be accessed were dense rather
than distributed across the whole table.

>2. Since we only select the "idx" column, why does the BRIN index not
>simply return the searched value if included in one of it's ranges?
>Hitting the actual row data stored in the table seems to be unnessary no?

Because it's necessary to check if the tuple is visible to the current
transaction.  It might be from an uncommited/aborted transaction.

-- 
Justin

Re: BRIN index worse than sequential scan for large search set

2023-02-24 Thread Justin Pryzby

On Fri, Feb 24, 2023 at 06:51:00PM +0100, Mickael van der Beek wrote:
> Hello Justin,
> 
> Thanks for the quick response!
> 
> > The table may be dense, but the tuples aren't.  You're asking to return
> > 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> > tuples per page, and you need to read about every 10th page.  It makes
> > sense that it's slow to read a large amount of data nonsequentially.
> 
> Ah, of course, you're right!
> I forgot that the BRIN indexes store ranges that are not fully covered by
> the row values and that PostgreSQL has to double-check (bitmap heap scan)
> ...
> Would you thus advise to only use BRIN indexes for columns who's values are
> (1) monotonically increasing but also (2) close to each other?

It's not important whether they're "rigidly" monotonic (nor "strictly").
What's important is that a query doesn't need to access a large number
of pages.

For example, some of the BRIN indexes that I'm familiar with are created
on a column called "start time", but the table's data tends to be
naturally sorted by "end time" - and that's good enough.  If someone
queries for data between 12pm and 1pm, there's surely no data for the
first 12 hours of the day's table (because it hadn't happened yet) and
there's probably no data for the last 9+ hours of the day, either, so
it's only got to read data for a 1-2h interval in the middle.  This
assumes that the column's data is typically correlated.  If the tuples
aren't clustered/"close to each other" then it probably doesn't work
well.  I haven't played with brin "multi minmax", though.

> > >2. Since we only select the "idx" column, why does the BRIN index not
> > >simply return the searched value if included in one of it's ranges?
> > >Hitting the actual row data stored in the table seems to be unnessary 
> > > no?
> >
> > Because it's necessary to check if the tuple is visible to the current
> > transaction.  It might be from an uncommited/aborted transaction.

Actually, a better explanation is that all the brin scan returns is the page,
and not the tuples.

"BRIN indexes can satisfy queries via regular bitmap index scans, and
will return all tuples in all pages within each range if the summary
info stored by the index is CONSISTENT with the query conditions.  The
query executor is in charge of rechecking these tuples and discarding
those that do not match the query conditions — in other words, these
indexes are LOSSY".

The index is returning pages where matching tuples *might* be found,
after excluding those pages where it's certain that no tuples are found.

-- 
Justin

Re: multicolumn partitioning help

2023-03-14 Thread Justin Pryzby

On Sun, Mar 12, 2023 at 01:59:32PM -0400, James Robertson wrote:
> Hey folks,
> I am having issues with multicolumn partitioning. For reference I am using
> the following link as my guide:
> https://www.postgresql.org/docs/devel/sql-createtable.html
> 
> Reading the documentation:   "When creating a range partition, the lower
> bound specified with FROM is an inclusive bound, whereas the upper bound
> specified with TO is an exclusive bound".
> 
> However I can't insert any of the following after the first one, because it
> says it overlaps. Do I need to do anything different when defining
> multi-column partitions?

The bounds are compared like rows:

When creating a range partition, the lower bound specified with FROM is
an inclusive bound, whereas the upper bound specified with TO is an
exclusive bound. That is, the values specified in the FROM list are
valid values of the corresponding partition key columns for this
partition, whereas those in the TO list are not. Note that this
statement must be understood according to the rules of row-wise
comparison (Section 9.24.5). For example, given PARTITION BY RANGE
(x,y), a partition bound FROM (1, 2) TO (3, 4) allows x=1 with any y>=2,
x=2 with any non-null y, and x=3 with any y<4.

https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON

> This works:
> CREATE TABLE humans_1968_0 PARTITION OF humans FOR VALUES FROM (1968, '0')
> TO (1969, '1');

This table is everything from 1968 (starting with '0') to 1969

> These fail:
> CREATE TABLE humans_1968_1 PARTITION OF humans FOR VALUES FROM (1968, '1')
> TO (1969, '2');

Which is why these are overlapping.

> CREATE TABLE humans_1969_1 PARTITION OF humans FOR VALUES FROM (1969, '1')
> TO (1970, '2');

This one doesn't fail, but it "occupies" / subjugates all of 1969
starting with 1.

-- 
Justin

1 2 3 4 5 >

1 - 100 of 406 matches

Mail list logo