[PERFORM] CREATE STATISTICS and join selectivity

2017-11-15 Thread Justin Pryzby
I just noticed that PG10 CREATE STATISTICS (dependencies) doesn't seem to work
for joins on multiple columns; is that right?

With statistics on table for 2017 but not 20171110:

ts=# CREATE STATISTICS x ON site_id,sect_id FROM eric_enodeb_cell_2017;
ts=# ANALYZE VERBOSE eric_enodeb_cell_2017;

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN 
eric_enodeb_cell_20171110 b USING(start_time, sect_id) WHERE a.site_id=318 AND 
sect_id=1489;
Nested Loop  (cost=0.83..4565.09 rows=1 width=0) (actual time=23.595..69.541 
rows=96 loops=1)
=> bad estimate on redundant WHERE WITHOUT multivar statistics

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN 
eric_enodeb_cell_2017 b USING(start_time, sect_id) WHERE a.site_id=318 AND 
sect_id=1489;
Nested Loop  (cost=0.83..4862.41 rows=96 width=0) (actual time=0.034..3.882 
rows=96 loops=1)
=> good estimate on redundant WHERE WITH multivar statistics

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN 
eric_enodeb_cell_20171110 b USING(start_time, sect_id);
Merge Join  (cost=18249.85..19624.18 rows=54858 width=0) (actual 
time=157.252..236.945 rows=55050 loops=1)
=> good estimate on JOIN on SECT_id without stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN 
eric_enodeb_cell_20171110 b USING(start_time, site_id);
Merge Join  (cost=0.83..14431.81 rows=261499 width=0) (actual 
time=0.031..259.382 rows=262638 loops=1)
=> good estimate on JOIN on SITE_id without stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN 
eric_enodeb_cell_2017 b USING(start_time, site_id);
Merge Join  (cost=0.83..14706.29 rows=268057 width=0) (actual 
time=37.360..331.276 rows=268092 loops=1)
=> good estimate on JOIN on SITE_id with stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN 
eric_enodeb_cell_2017 b USING(start_time, sect_id);
Merge Join  (cost=18560.89..19959.67 rows=55944 width=0) (actual 
time=130.865..198.439 rows=55956 loops=1)
=> good estimate on JOIN on SECT_id with stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN 
eric_enodeb_cell_2017 b USING(start_time, sect_id, site_id);
Gather  (cost=1000.83..1.06 rows=460 width=0) (actual time=1.686..149.707 
rows=55956 loops=1)
=> poor estimate on redundant JOIN WITH stats (??)

I've already fixed our reports to avoid this kind of thing and support our PG95
customers, but I tentatively would've expected PG10 MV stats to "know" that
USING(site_id, sect_id) is no more selective than USING(sect_id), same as it
knows that's true for WHERE site... AND sect

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query performance issue

2017-11-15 Thread Justin Pryzby
On Wed, Nov 15, 2017 at 03:03:39PM +0530, Samir Magar wrote:
> I am having performance issues with one of the query.
> The query is taking 39 min to fetch 3.5 mil records.
> 
> I want to reduce that time to 15 mins.
> could you please suggest something to its performance?

> "HashAggregate  (cost=4459.68..4459.69 rows=1 width=27) (actual 
> time=2890035.403..2892173.601 rows=3489861 loops=1)"

Looks to me like the problem is here:

> "  ->  Index Only Scan using idxdq7 on dlr_qlfy  (cost=0.43..4.45 ROWS=1 
> width=16) (actual time=0.009..0.066 ROWS=121 loops=103987)"
> "Index Cond: ((qlfy_grp_id = dlr_grp.dlr_grp_id) AND (qlf_flg = 
> 'N'::bpchar) AND (cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id))"
> "Heap Fetches: 0"

Returning 100x more rows than expected and bubbling up through a cascade of
nested loops.

Are those 3 conditions independent ?  Or, perhaps, are rows for which
"qlfy_grp_id=dlr_grp.dlr_grp_id" is true always going to have
"cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id" ?

Even if it's not "always" true, if rows which pass the one condition are more
likely to pass the other condition, this will cause an underestimate, as
obvserved.

You can do an experiment SELECTing just from those two tables joined and see if
you can reproduce the problem with poor rowcount estimate (hopefully in much
less than 15min).

If you can't drop one of the two conditions, you can make PG treat it as a
single condition for purpose of determining expected selectivity, using a ROW()
comparison like:

ROW(qlfy_grp_id, cog_grp_id) = ROW(dlr_grp.dlr_grp_id, 
dlr_grp_dlr_xref_1.dlr_grp_id)

If you're running PG96+ you may also be able to work around this by adding FKs.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] overestimate on empty table

2017-11-11 Thread Justin Pryzby
On Fri, Nov 10, 2017 at 04:19:41PM -0500, Tom Lane wrote:
> Justin Pryzby <pry...@telsasoft.com> writes:
> > (or, the opposite of the more common problem)

> > As the queued_alters table is typically empty (and autoanalyzed with
> > relpages=0), I see "why":
> 
> > ./src/backend/optimizer/util/plancat.c
> > |if (curpages < 10 &&
> > |rel->rd_rel->relpages == 0 &&
> > |!rel->rd_rel->relhassubclass &&
> > |rel->rd_rel->relkind != RELKIND_INDEX)
> > |curpages = 10;
> 
> So I'm sure you read the comment above that, too.

> One idea is to say that relpages = reltuples = 0 is only the state that
> prevails for a freshly-created table, and that VACUUM or ANALYZE should
> always set relpages to at least 1 even if the physical size is zero.

> Dunno if that would confuse people.

What about adding && rel->rd_rel->reltuples==0, and make VACUUM/ANALYZE instead
set only reltuples=1, since that's already done at costsize.c: clamp_row_est()
and therefor no additional confusion?

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] overestimate on empty table

2017-11-10 Thread Justin Pryzby
(or, the opposite of the more common problem)

I wrote this query some time ago to handle "deferred" table-rewriting type
promoting ALTERs of a inheritence children, to avoid worst-case disk usage
altering the whole heirarchy, and also locking the entire heirarchy against
SELECT and INSERT.

ts=# explain analyze SELECT child c, parent p, array_agg(colpar.attname::text) 
cols, array_agg(colpar.atttypid::regtype) AS types FROM
queued_alters qa JOIN pg_attribute colpar ON 
qa.parent::regclass=colpar.attrelid JOIN
pg_attribute colcld ON qa.child::regclass=colcld.attrelid WHERE
colcld.attname=colpar.attname AND colpar.atttypid!=colcld.atttypid GROUP BY 1,2
ORDER BY regexp_replace(child, 
'.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$', '\3\5') DESC, 
-- by MM
child~'_[0-9]{6}$' DESC, -- monthly tables first
regexp_replace(child, '.*_', '') DESC -- by MMDD
LIMIT 1;

Unfortunately we get this terrible plan:

Limit  (cost=337497.59..337497.60 rows=1 width=184) (actual 
time=2395.283..2395.283 rows=0 loops=1)
  ->  Sort  (cost=337497.59..337500.04 rows=980 width=184) (actual 
time=2395.281..2395.281 rows=0 loops=1)
Sort Key: (regexp_replace((qa.child)::text, 
'.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$'::text, 
'\3\5'::text)) DESC, (((qa.child)::text ~ '_[0-9]{6}$'::text)) DESC, 
(regexp_replace((qa.child)::text, '.*_'::text, ''::text)) DESC
Sort Method: quicksort  Memory: 25kB
->  HashAggregate  (cost=337470.64..337492.69 rows=980 width=184) 
(actual time=2395.273..2395.273 rows=0 loops=1)
  Group Key: qa.child, qa.parent
  ->  Gather  (cost=293727.20..336790.89 rows=54380 width=123) 
(actual time=2395.261..2395.261 rows=0 loops=1)
Workers Planned: 3
Workers Launched: 3
->  Hash Join  (cost=292727.20..330352.89 rows=17542 
width=123) (actual time=2341.328..2341.328 rows=0 loops=4)
  Hash Cond: qa.child)::regclass)::oid = 
colcld.attrelid) AND (colpar.attname = colcld.attname))
  Join Filter: (colpar.atttypid <> colcld.atttypid)
  ->  Merge Join  (cost=144034.27..151009.09 
rows=105280 width=123) (actual time=514.820..514.820 rows=0 loops=4)
Merge Cond: (colpar.attrelid = 
(((qa.parent)::regclass)::oid))
->  Sort  (cost=143965.78..145676.59 
rows=684322 width=72) (actual time=514.790..514.790 rows=1 loops=4)
  Sort Key: colpar.attrelid
  Sort Method: external merge  Disk: 78448kB
  ->  Parallel Seq Scan on pg_attribute 
colpar  (cost=0.00..77640.22 rows=684322 width=72) (actual time=0.011..164.106 
rows=445582 loops=4)
->  Sort  (cost=68.49..70.94 rows=980 width=55) 
(actual time=0.031..0.031 rows=0 loops=3)
  Sort Key: (((qa.parent)::regclass)::oid)
  Sort Method: quicksort  Memory: 25kB
  ->  Seq Scan on queued_alters qa  
(cost=0.00..19.80 rows=980 width=55) (actual time=0.018..0.018 rows=0 loops=3)
  ->  Hash  (cost=92010.97..92010.97 rows=2121397 
width=72) (actual time=1786.056..1786.056 rows=1782330 loops=4)
Buckets: 2097152  Batches: 2  Memory Usage: 
106870kB
->  Seq Scan on pg_attribute colcld  
(cost=0.00..92010.97 rows=2121397 width=72) (actual time=0.027..731.554 
rows=1782330 loops=4)

As the queued_alters table is typically empty (and autoanalyzed with
relpages=0), I see "why":

./src/backend/optimizer/util/plancat.c
|if (curpages < 10 &&
|rel->rd_rel->relpages == 0 &&
|!rel->rd_rel->relhassubclass &&
|rel->rd_rel->relkind != RELKIND_INDEX)
|curpages = 10;


Indeed it works much better if I add a child table as a test/kludge:

  ->  Sort  (cost=306322.49..306323.16 rows=271 width=403) (actual 
time=4.945..4.945 rows=0 loops=1)
Sort Key: (regexp_replace((qa.child)::text, 
'.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$'::text, 
'\3\5'::text)) DESC, (((qa.child)::text ~ '_[0-9]{6}$'::text)) DESC, 
(regexp_replace((qa.child)::text, '.*_'::text, ''::text)) DESC
Sort Method: quicksort  Memory: 25kB
->  GroupAggregate  (cost=306089.46..306321.13 rows=271 width=403) 
(actual time=4.938..4.938 rows=0 loops=1)
  Group Key: qa.child, qa.parent
  ->  Sort  (cost=306089.46..306127.06 rows=15038 width=342) 
(actual time=4.936..4.936 rows=0 loops=1)
Sort Key: qa.child, qa.parent
Sort Method: quicksort  Memory: 25kB
->  Gather  

Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

2017-11-06 Thread Justin Pryzby
On 11/6/17, 9:21 AM, "Justin Pryzby" <pry...@telsasoft.com> wrote:
> see if statistics improve:
> SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, 
> tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv,
> FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP 
> BY 1,2,3,4,5 ORDER BY 1

On Mon, Nov 06, 2017 at 09:12:01PM +, Adam Torres wrote:
> I changed the statistics on av.customer_id as suggested and the number
> returned by pg_stats went from 202,333 to 904,097.

Do you mean n_distinct ?  It' be useful to see that query on pg_stats.  Also I
don't know that we've seen \d output for the tables (or at least the joined
columns) or the full query ?

> There are 11.2 million distinct customer_ids on the 14.8 million vehicle
> records.

If there's so many distinct ids, updating stats won't help the rowcount
estimate (and could even hurt) - it can only store 1 most-common-values.

Are there as many distinct values for cc.id ?

I would try to reproduce the rowcount problem with a minimal query:
explain analyze SELECT FROM av JOIN cc ON av.customer_id=cc.id; --WHERE 
cc.id<99;
Maybe the rows estimate is okay for some values and not for others, so maybe
you need to try various WHERE (with JOIN an additional tables if need be...but
without reimplementing the whole query).

I just noticed there are two conditions on dealer_id, one from table av and one
from table cc_1.  It seems likely those are co-related/non-independent
conditions..but postgres probably doesn't know that (unless you used PG96 FK
logic, or PG10 multi-variable stats). 

As a test, you could try dropping one of those conditions, or maybe a hacky
change like ROW(av.dealer_id, cc_1.dealer_id)=ROW('EC79', 'EC79'),
which postgres estimates as no more selective than a single equality test.  BTW
this is all from src/backend/utils/adt/selfuncs.c.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

2017-11-06 Thread Justin Pryzby
On Mon, Nov 06, 2017 at 01:18:00PM +, Adam Torres wrote:
> Good morning all,
> 
> We have a problem with performance after upgrading from 9.3 to 9.6 where 
> certain queries take 9 times longer to run.  On our initial attempt to 
> upgrade, we noticed the system as a whole was taking longer to run through 
> normal daily processes.  The query with the largest run time was picked to 
> act as a measuring stick.

> https://explain.depesz.com/s/z71u
> Planning time: 8.218 ms
> Execution time: 639319.525 ms
> 
> Same query as run on 9.3
> https://explain.depesz.com/s/gjN3
> Total runtime: 272897.150 ms

Actually it looks to me like both query plans are poor..

..because of this:
| Hash Join (cost=85,086.25..170,080.80 ROWS=40 width=115) (actual 
time=32.673..84.427 ROWS=13,390 loops=1)
|Hash Cond: (av.customer_id = cc_1.id)

If there are a large number of distinct customer_ids (maybe with nearly equal
frequencies), it might help to
ALTER TABLE av ALTER customer_id SET STATISTICS 400
..same for cc_1.id.  And re-analyze those tables (are they large??).

see if statistics improve:
SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, 
attname, n_distinct, array_length(most_common_vals,1) n_mcv,
FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP BY 
1,2,3,4,5 ORDER BY 1

Goal is to get at least an accurate value for n_distinct (but preferably also
storing the most frequent IDs).  I wouldn't bother re-running the query unless
you find that increasing stats target causes the plan to change.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance drop after upgrade (9.6 > 10)

2017-10-26 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 04:15:59PM +0200, johannes graën wrote:
> Hi Pavel, *,
> 
> you were right with ANALYZing the DB first. However, even after doing
> so, I frequently see Seq Scans where an index was used before. This
> usually cooccurs with parallelization and looked different before
> upgrading to 10. I can provide an example for 10 [1], but I cannot
> generate a query plan for 9.6 anymore.
> 
> Any ideas what makes the new version more seqscanny?

Is it because max_parallel_workers_per_gather now defaults to 2 ?

BTW, I would tentatively expect a change in default to be documented in the
release notes but can't see that it's.
77cd477c4ba885cfa1ba67beaa82e06f2e182b85

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance drop after upgrade (9.6 > 10)

2017-10-24 Thread Justin Pryzby
On Tue, Oct 24, 2017 at 04:15:59PM +0200, johannes graën wrote:
> upgrading to 10. I can provide an example for 10 [1], but I cannot
> generate a query plan for 9.6 anymore.

You could (re)install PG96 alongside PG10 and run a copy of the DB (even from
your homedir, or on a difference server) and pg_dump |pg_restore the relevant
tables (just be sure to specify the alternate host/port/user/etc as needed for
the restore invocation).

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Partitioned table - scans through every partitions

2017-08-25 Thread Justin Pryzby
On Fri, Aug 25, 2017 at 03:36:29PM +, Aniko Belim wrote:
> Hi,
> 
> We have an issue with one of our partitioned tables. It has a column with 
> timestamp without time zone type, and we had to partition it daily. To do 
> that, we created the following constraints like this example:
> CHECK (to_char(impression_time, 'MMDD'::text) = '20170202'::text)
> 
> 
> The problem we’re facing is no matter how we’re trying to select from it, it 
> scans through every partitions.


> It scans through every partitions. Shouldn’t it only scan the 
> dfp_in_network_impressions.dfp_in_network_impressions_20170202 child table? 
> Or we missing something?
> Any advice/help would highly appreciated.

https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html#DDL-PARTITIONING-CAVEATS
|The following caveats apply to constraint exclusion:
|Constraint exclusion only works when the query's WHERE clause contains
|constants (or externally supplied parameters). For example, a comparison
|against a non-immutable function such as CURRENT_TIMESTAMP cannot be
|optimized, since the planner cannot know which partition the function value
|might fall into at run time.

...


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance Issue -- "Materialize"

2017-08-21 Thread Justin Pryzby
On Sat, Aug 19, 2017 at 10:37:56AM -0700, anand086 wrote:
> +---+|
>
> QUERY PLAN
> 
> |+---+|
> Aggregate  (cost=351405.08..351405.09 rows=1 width=8) 
>

Would you send explain ANALYZE and not just explain ?

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very poor read performance, query independent

2017-07-18 Thread Justin Pryzby
On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote:
> On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire  
> wrote:
> > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> >  wrote:
> >> Claudio,
> >>
> >> At one moment
> >> during the query, there is a write storm to the swap drive (a bit like this
> >> case:
> >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
> >> I can hardly explain it as there is plenty of memory on this server.
> >
> > That sounds a lot like NUMA zone_reclaim issues:
> >
> > https://www.postgresql.org/message-id/500616cb.3070...@2ndquadrant.com
> 
> I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
> eerily similar.

Did you look at disabling KSM and/or THP ?

sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run'

https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com
https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zfsol14rqnqa8hwz1r4rsndlr1b...@mail.gmail.com
https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com
https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270...@yahoo.es
https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com
https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.p...@n5.nabble.com
https://www.postgresql.org/message-id/cal_0b1tjozcx3lo3eve1rqgat%2bjj_q7w4pkj87wfwwxbtug...@mail.gmail.com
https://www.postgresql.org/message-id/556e2068.7070...@vuole.me
https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com
https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com
https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


estimate correlation of index separately from table (Re: [PERFORM] index fragmentation on insert-only table with non-unique column)

2017-07-07 Thread Justin Pryzby
Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s.  References:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com

My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan.  I've now
cobbled together a patch to throw around and see if we can improve on that.  I
tried several changes, hoping to discourage index scan.  The logic that seems
to most accurately reflect costs has two changes:  

Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).

The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index.  Postgres' existing sampling is
designed to give an even sample across all rows.  An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.

I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.

Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential.  It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location).  Think of
inserting a "normal" distribution of timestamps centered around now().

My original report shows lseek() for nearly every read on the *heap*.  The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats.  We
import data records from a file, which is probably more or less in order of
"end time".

That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential.  Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart...  but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h.  But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.

It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs.  Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive.  So I implemented index scan of the MCV list.  I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior.  MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional
seconds).

I ran pageinspect on a recent child of the table that triggered the original
report:

ts=# SELECT itemoffset, ctid FROM 
bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 
22 OFFSET 1;
 itemoffset |  ctid  
+
  2 | (81,4)
  3 | (5,6)
  4 | (6,5)
  5 | (9,1)
  6 | (10,1)
  7 | (14,1)
  8 | (21,8)
  9 | (25,1)
 10 | (30,5)
 11 | 

Re: [PERFORM] Re: join under-estimates with ineq conditions

2017-06-15 Thread Justin Pryzby
I never heard back but was hoping for some feedback/discussion about this 2nd
problem/patch.

just a reminder - Thanks

On Thu, Jun 08, 2017 at 11:05:38AM -0500, Justin Pryzby wrote:
> On Mon, Jun 05, 2017 at 05:02:32PM -0400, Tom Lane wrote:
> > Justin Pryzby <pry...@telsasoft.com> writes:
> > > diff --git a/src/backend/utils/adt/selfuncs.c 
> > > b/src/backend/utils/adt/selfuncs.c
> > > +   if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows;
> > > +   if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows;
> > 
> > I don't like this change too much.
> 
> Thanks for your analysis ;)
> 
> I have a 2nd patch which improves the 2nd case I mentioned..
> 
> > I note for instance that this patch would do nothing at all for the toy
> 
> >> There's still an 2nd issue which this doesn't address, having to do with 
> >> joins
> >> of tables with full/complete MCV lists, and selective queries on those 
> >> tables,
> >> as demonstrated by the artificial test:
> >>
> >> > postgres=# CREATE TABLE t(i INT);
> >> > postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) 
> >> > i,generate_series(1,99);ANALYZE t;
> >> > postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) 
> >> > frac_MCV, tablename, attname, n_distinct, 
> >> > array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) 
> >> > n_hist, (SELECT MAX(x) FROM unnest(most_common_vals::text::text[]) x) 
> >> > maxmcv, 
> >> > (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] 
> >> > maxhist FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 
> >> > 1,2,3,4,5,6,7,8 ORDER BY 1 DESC;
> 
> I pointed out that there were two issues, both involving underestimates from
> querying a fraction of a table using inequality condition.  One due to join
> estimate based on "nd" (and not substantially based on MCV), and one due to
> frequencies associated with MCV list (and not substantially falling back to
> estimate from "nd").
> 
> I made another patch to address the 2nd issue, which affects our 
> pre-aggregated
> tables (which are partitioned by month, same as the raw tables).  The
> aggregated tables are the result of something like SELECT start_time::date, 
> k1,
> k2, ..., sum(a), avg(b) ... GROUP BY 1,2,3, so have many fewer rows, and nd 
> for
> start_time::date column would be at most 31, so MCV list would be expected to
> be complete, same as the "toy" example I gave.
> 
> Sometimes when we query the aggregated tables for a small number of days we 
> get
> underestimate leading to nested loops..
> 
> Without patch:
> Merge Join  (cost=339.59..341.57 rows=99 width=4) (actual time=10.190..17.430 
> rows=9801 loops=1)
> 
> With patch:
> DEBUG:  ndfactor 99.00 99.00
> DEBUG:  nmatches 99 matchprodfreq 1.00
> DEBUG:  nmatches 99 matchprodfreq 1.00
> DEBUG:  matchfreq1 99.00 unmatchfreq1 0.00
> DEBUG:  matchfreq1 1.00 unmatchfreq1 0.00
> DEBUG:  matchfreq2 99.00 unmatchfreq2 0.00
> DEBUG:  matchfreq2 1.00 unmatchfreq2 0.00
> DEBUG:  otherfreq1 0.00 otherfreq2 0.00
> DEBUG:  select(1) 1.00
>  Hash Join  (cost=167.75..444.77 rows=9801 width=4) (actual 
> time=4.706..13.892 rows=9801 loops=1)
> 
> 
> diff --git a/src/backend/utils/adt/selfuncs.c 
> b/src/backend/utils/adt/selfuncs.c
> index 6a4f7b1..bc88423 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -2279,6 +2279,14 @@ eqjoinsel_inner(Oid operator,
>  
>   nd1 = get_variable_numdistinct(vardata1, );
>   nd2 = get_variable_numdistinct(vardata2, );
> + float ndfactor1=1;
> + float ndfactor2=1;
> + if (vardata1->rel->rows)
> + ndfactor1=vardata1->rel->tuples / vardata1->rel->rows;
> + if (vardata2->rel->rows)
> + ndfactor2=vardata2->rel->tuples / vardata2->rel->rows;
> + // ndfactor1=ndfactor2=1;
> + elog(DEBUG4, "ndfactor %lf %lf", ndfactor1,ndfactor2);
>  
>   opfuncoid = get_opcode(operator);
>  
> @@ -2375,7 +2383,19 @@ eqjoinsel_inner(Oid operator,
>   }
>   }
>   }
> +
> + // you might think we should multiple by ndfactor1*ndfactor2,
> + // but that gives serious overestimates...
> + // matchprodfreq*= ndfactor1>ndfactor2?ndfactor1:ndfactor2;
> + // matchprodfreq*=ndfactor1;
> + 

Re: [PERFORM] Re: join under-estimates with ineq conditions

2017-06-08 Thread Justin Pryzby
On Mon, Jun 05, 2017 at 05:02:32PM -0400, Tom Lane wrote:
> Justin Pryzby <pry...@telsasoft.com> writes:
> > diff --git a/src/backend/utils/adt/selfuncs.c 
> > b/src/backend/utils/adt/selfuncs.c
> > +   if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows;
> > +   if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows;
> 
> I don't like this change too much.

Thanks for your analysis ;)

I have a 2nd patch which improves the 2nd case I mentioned..

> I note for instance that this patch would do nothing at all for the toy

>> There's still an 2nd issue which this doesn't address, having to do with 
>> joins
>> of tables with full/complete MCV lists, and selective queries on those 
>> tables,
>> as demonstrated by the artificial test:
>>
>> > postgres=# CREATE TABLE t(i INT);
>> > postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) 
>> > i,generate_series(1,99);ANALYZE t;
>> > postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) 
>> > frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) 
>> > n_mcv, array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM 
>> > unnest(most_common_vals::text::text[]) x) maxmcv, 
>> > (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist 
>> > FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 1,2,3,4,5,6,7,8 
>> > ORDER BY 1 DESC;

I pointed out that there were two issues, both involving underestimates from
querying a fraction of a table using inequality condition.  One due to join
estimate based on "nd" (and not substantially based on MCV), and one due to
frequencies associated with MCV list (and not substantially falling back to
estimate from "nd").

I made another patch to address the 2nd issue, which affects our pre-aggregated
tables (which are partitioned by month, same as the raw tables).  The
aggregated tables are the result of something like SELECT start_time::date, k1,
k2, ..., sum(a), avg(b) ... GROUP BY 1,2,3, so have many fewer rows, and nd for
start_time::date column would be at most 31, so MCV list would be expected to
be complete, same as the "toy" example I gave.

Sometimes when we query the aggregated tables for a small number of days we get
underestimate leading to nested loops..

Without patch:
Merge Join  (cost=339.59..341.57 rows=99 width=4) (actual time=10.190..17.430 
rows=9801 loops=1)

With patch:
DEBUG:  ndfactor 99.00 99.00
DEBUG:  nmatches 99 matchprodfreq 1.00
DEBUG:  nmatches 99 matchprodfreq 1.00
DEBUG:  matchfreq1 99.00 unmatchfreq1 0.00
DEBUG:  matchfreq1 1.00 unmatchfreq1 0.00
DEBUG:  matchfreq2 99.00 unmatchfreq2 0.00
DEBUG:  matchfreq2 1.00 unmatchfreq2 0.00
DEBUG:  otherfreq1 0.00 otherfreq2 0.00
DEBUG:  select(1) 1.00
 Hash Join  (cost=167.75..444.77 rows=9801 width=4) (actual time=4.706..13.892 
rows=9801 loops=1)


diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6a4f7b1..bc88423 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2279,6 +2279,14 @@ eqjoinsel_inner(Oid operator,
 
nd1 = get_variable_numdistinct(vardata1, );
nd2 = get_variable_numdistinct(vardata2, );
+   float ndfactor1=1;
+   float ndfactor2=1;
+   if (vardata1->rel->rows)
+   ndfactor1=vardata1->rel->tuples / vardata1->rel->rows;
+   if (vardata2->rel->rows)
+   ndfactor2=vardata2->rel->tuples / vardata2->rel->rows;
+   // ndfactor1=ndfactor2=1;
+   elog(DEBUG4, "ndfactor %lf %lf", ndfactor1,ndfactor2);
 
opfuncoid = get_opcode(operator);
 
@@ -2375,7 +2383,19 @@ eqjoinsel_inner(Oid operator,
}
}
}
+
+   // you might think we should multiple by ndfactor1*ndfactor2,
+   // but that gives serious overestimates...
+   // matchprodfreq*= ndfactor1>ndfactor2?ndfactor1:ndfactor2;
+   // matchprodfreq*=ndfactor1;
+   // matchprodfreq*=ndfactor2;
+   // matchprodfreq*= ndfactor1<ndfactor2?ndfactor1:ndfactor2;
+   matchprodfreq*= ndfactor1<ndfactor2?ndfactor1:ndfactor2;
+
+   elog(DEBUG4, "nmatches %d matchprodfreq %lf", nmatches, 
matchprodfreq);
CLAMP_PROBABILITY(matchprodfreq);
+   elog(DEBUG4, "nmatches %d matchprodfreq %lf", nmatches, 
matchprodfreq);
+
/* Sum up frequencies of matched and unmatched MCVs */
matchfreq1 = unmatchfreq1 = 0.0;
for (i = 0; i < nvalues1; i++)
@@ -2385,8 +2405,14 @@ eqjoinsel_inner(Oid operator,
 

[PERFORM] Re: join estimate of subqueries with range conditions and constraint exclusion

2017-05-30 Thread Justin Pryzby
On Wed, May 24, 2017 at 04:17:30PM -0500, Justin Pryzby wrote:
> We got bitten again by what appears to be the same issue I reported (perhaps
> poorly) here:
> https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com

> I'm diagnosing a bad estimate/plan due to excessively high n_distinct leading
> to underestimated rowcount when selecting from a small fraction of the table
> heirarchy.  This leads intermittently to bad things, specifically a cascade of
> misestimates and associated nested loops around millions of rows.

I dug into this some more;  I can mitigate the issue with this change:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6a4f7b1..962a5b4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2279,6 +2279,22 @@ eqjoinsel_inner(Oid operator,
 
nd1 = get_variable_numdistinct(vardata1, );
nd2 = get_variable_numdistinct(vardata2, );
+   elog(DEBUG4, "nd %lf %lf", nd1 ,nd2);
+   if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows;
+   if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows;
+
+   elog(DEBUG4, "nd %lf %lf", nd1 ,nd2);
+   elog(DEBUG4, "rows %lf %lf", vardata1->rel->rows ,vardata2->rel->rows);
+   elog(DEBUG4, "tuples %lf %lf", vardata1->rel->tuples 
,vardata2->rel->tuples);

original estimate:

DEBUG:  nd 35206.00 35206.00
DEBUG:  nd 35206.00 35206.00
DEBUG:  rows 5031.00 5031.00
DEBUG:  tuples 5031.00 5031.00
   
QUERY PLAN

 Hash Join  (cost=1294.56..2558.62 rows=723 width=750) (actual 
time=103.273..490.984 rows=50300 loops=1)
   Hash Cond: (eric_enodeb_metrics.start_time = 
eric_enodeb_metrics_1.start_time)

patched estimate/plan:

postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM eric_enodeb_metrics 
WHERE start_time>'2017-04-25 18:00') x JOIN (SELECT * FROM eric_enodeb_metrics 
WHERE start_time>'2017-04-25 18:00') y USING (start_time);

DEBUG:  nd 35206.00 35206.00
DEBUG:  nd 5031.00 5031.00
DEBUG:  rows 5031.00 5031.00
DEBUG:  tuples 5031.00 5031.00

| Hash Join  (cost=1294.56..2602.14 rows=5075 width=750) (actual 
time=90.445..477.712 rows=50300 loops=1)
|   Hash Cond: (eric_enodeb_metrics.start_time = 
eric_enodeb_metrics_1.start_time)
|   ->  Append  (cost=0.00..1231.67 rows=5031 width=379) (actual 
time=16.424..46.899 rows=5030 loops=1)
| ->  Seq Scan on eric_enodeb_metrics  (cost=0.00..0.00 rows=1 
width=378) (actual time=0.012..0.012 rows=0 loops=1)
|   Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with 
time zone)
| ->  Seq Scan on eric_enodeb_201704  (cost=0.00..1231.67 rows=5030 
width=379) (actual time=16.408..45.634 rows=5030 loops=1)
|   Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with 
time zone)
|   Rows Removed by Filter: 23744
|   ->  Hash  (cost=1231.67..1231.67 rows=5031 width=379) (actual 
time=73.801..73.801 rows=5030 loops=1)
| Buckets: 8192  Batches: 1  Memory Usage: 1283kB
| ->  Append  (cost=0.00..1231.67 rows=5031 width=379) (actual 
time=14.607..47.395 rows=5030 loops=1)
|   ->  Seq Scan on eric_enodeb_metrics eric_enodeb_metrics_1  
(cost=0.00..0.00 rows=1 width=378) (actual time=0.009..0.009 rows=0 loops=1)
| Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp 
with time zone)
|   ->  Seq Scan on eric_enodeb_201704 eric_enodeb_201704_1  
(cost=0.00..1231.67 rows=5030 width=379) (actual time=14.594..46.091 rows=5030 
loops=1)
| Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp 
with time zone)
| Rows Removed by Filter: 23744

.. which gets additionally extreme with increasingly restrictive condition, as
rows estimate diverges more from nd.

There's still an 2nd issue which this doesn't address, having to do with joins
of tables with full/complete MCV lists, and selective queries on those tables,
as demonstrated by the artificial test:

> postgres=# CREATE TABLE t(i INT);
> postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) 
> i,generate_series(1,99);ANALYZE t;
> postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, 
> tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, 
> array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM 
> unnest(most_common_vals::text::text[]) x) maxmcv, 
> (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist 
> FROM pg_stats WHERE attname

[PERFORM] join estimate of subqueries with range conditions and constraint exclusion

2017-05-24 Thread Justin Pryzby
We got bitten again by what appears to be the same issue I reported (perhaps
poorly) here:
https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com

We have PG9.6.3 table heirarchies partitioned by time.  Our reports use
subqueries each with their own copies of a range clauses on time column, as
needed to get constraint exclusion reference:
https://www.postgresql.org/message-id/25076.1366321335%40sss.pgh.pa.us

SELECT * FROM
(SELECT * FROM t WHERE col>const) a JOIN
(SELECT * FROM t WHERE col>const) b USING (col)

I'm diagnosing a bad estimate/plan due to excessively high n_distinct leading
to underestimated rowcount when selecting from a small fraction of the table
heirarchy.  This leads intermittently to bad things, specifically a cascade of
misestimates and associated nested loops around millions of rows.

Artificial/generated/contrived test case, involving table with 99 instances
each of 99 values:

postgres=# CREATE TABLE t(i INT);
postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) 
i,generate_series(1,99);ANALYZE t;
postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, 
tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, 
array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM 
unnest(most_common_vals::text::text[]) x) maxmcv, 
(histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist FROM 
pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 1,2,3,4,5,6,7,8 ORDER BY 
1 DESC;
-[ RECORD 1 ]--
frac_mcv   | 1
tablename  | t
attname| i
n_distinct | 99
n_mcv  | 99
n_hist |
maxmcv | 99
maxhist|

range query (which could use constraint exclusion), but bad estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i<2) AS a JOIN 
(SELECT * FROM t WHERE i<2) AS b USING (i);
 Merge Join  (cost=339.59..341.57 rows=99 width=4) (actual time=8.272..16.892 
rows=9801 loops=1)

range query which could NOT use constraint exclusion, good estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t) AS a JOIN (SELECT * 
FROM t) AS b USING (i) WHERE i<2;
 Hash Join  (cost=264.52..541.54 rows=9801 width=4) (actual time=12.688..22.325 
rows=9801 loops=1)

non-range query, good estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i=3) AS a JOIN 
(SELECT * FROM t WHERE i=3) AS b USING (i);
 Nested Loop  (cost=0.00..455.78 rows=9801 width=4) (actual time=0.482..15.820 
rows=9801 loops=1)

My understanding:
Postgres estimates join selectivity using number of distinct values of
underlying.  For the subqueries "a" and "b", the estimate is same as for
underlying table "t", even when selecting only a small fraction of the table...
This is adt/selfuncs:eqjoinsel_inner().

Note, in my tests, report queries on the child table have correct estimates;
and, queries with only "push down" WHERE clause outside the subquery have
correct estimate (but not constraint exclusion), apparently due to
calc_joinrel_size_estimate() returning the size of the parent table, planning
an join without restriction clause, following by filtering the join result, at
which point I guess the MCV list becomes useful and estimate is perfect..

SELECT * FROM
(SELECT * FROM t)a JOIN(SELECT * FROM t)b
USING (col) WHERE col>const

So my original question is basically still opened ... is it possible to get
both good estimates/plans AND constraint exclusion ??

Thanks
Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issue in PostgreSQL server...

2017-04-27 Thread Justin Pryzby
On Mon, Mar 06, 2017 at 12:17:22PM +, Dinesh Chandra 12108 wrote:
> Below is the output of Query SELECT * FROM pg_stats WHERE tablename='point' 
> AND attname='domain_class_id' ;
> 
> 
> schemaname | tablename | attname | inherited | null_frac | avg_width 
> | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | 
> correlation
> 
> "evidence"|"point"|"domain_class_id"|f|0|8|10|"{7,9,2,11,43,3,1,10,4,17}"|"{0.9322,0.0451333,0.0145,0.0039,0.0018,0.00146667,0.0005,0.0003,6.7e-05,6.7e-05}"|""|0.889078

On Fri, Mar 03, 2017 at 12:44:07PM +, Dinesh Chandra 12108 wrote:
>->  Index Scan using point_domain_class_id_index on point p  
> (cost=0.00..1483472.70 rows=1454751 width=16) (actual time=27.265..142101.1 
> 59 rows=1607491 loops=1)

On Sun, Mar 05, 2017 at 08:23:08PM -0800, Jeff Janes wrote:
> Why wouldn't this be using a bitmap scan rather than a regular index scan?
> It seems like it should prefer the bitmap scan, unless the table is well
> clustered on domain_class_id.  In which case, why isn't it just faster?

I missed your response until now, and can't see that anybody else responded,
but I suspect the issue is that the *table* is highly correlated WRT this
column, but the index may not be, probably due to duplicated index keys.
postgres only stores statistics on expression indices, and falls back to
correlation of table column of a simple indices.

If you're still fighting this, would you send result of:

SELECT domain_class_id, count(1) FROM point GROUP BY 1 ORDER BY 2 DESC LIMIT 22;
or,
SELECT count(1) FROM point GROUP BY domain_class_id ORDER BY 1 DESC LIMIT 22;

if there's much repetition in the index keys, then PG's planner thinks an index
scan has low random_page_cost, and effective_cache_size has little effect on
large tables, and it never uses bitmap scan, which blows up if the index is
fragmented and has duplicate keys.  The table reads end up costing something
like 1454751*random_page_cost nonsequential reads and fseek() calls when it
thinks it'll cost only 1454751*16*seq_page_cost.

Is the query much faster if you first reindex point_domain_class_id_index ?

This has come up before, see:
> https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com
> https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] self join estimate and constraint exclusion

2017-04-14 Thread Justin Pryzby
I mailed last month [0] but didn't see any reponse .. (if I'm being naive,
daft, or missing something simple, please just say so).

[0] https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com

It seems when self (inner/equi) joining there's two bad alternatives: either
specify a where clause for each self-joined table and incur poor estimate and
plan, due to incorrect perceived independence of clauses, even though joined
column(s) could/ought to be known equal; or, specify where clause only once,
and incur cost of joining across all partitions, due to no contraint exclusion
on one (or more) self-joined table heirarchy/s.

-- Specify WHERE for each table causes bad underestimate:
|ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN 
eric_enodeb_metrics b USING (start_time, site_id) WHERE 
a.start_time>='2017-03-19' AND a.start_time<'2017-03-20' AND 
b.start_time>='2017-03-19' AND b.start_time<'2017-03-20';
| Hash Join  (cost=7310.80..14680.86 rows=14 width=1436) (actual 
time=33.053..73.180 rows=7869 loops=1)
|   Hash Cond: ((a.start_time = b.start_time) AND (a.site_id = b.site_id))
|   ->  Append  (cost=0.00..7192.56 rows=7883 width=723) (actual 
time=1.394..19.414 rows=7869 loops=1)
| ->  Seq Scan on eric_enodeb_metrics a  (cost=0.00..0.00 rows=1 
width=718) (actual time=0.003..0.003 rows=0 loops=1)
|   Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp 
with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time 
zone))
| ->  Bitmap Heap Scan on eric_enodeb_201703 a_1  (cost=605.34..7192.56 
rows=7882 width=723) (actual time=1.390..14.536 rows=7869 loops=1)
|   Recheck Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   Heap Blocks: exact=247
|   ->  Bitmap Index Scan on eric_enodeb_201703_unique_idx  
(cost=0.00..603.37 rows=7882 width=0) (actual time=1.351..1.351 rows=7869 
loops=1)
| Index Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   ->  Hash  (cost=7192.56..7192.56 rows=7883 width=723) (actual 
time=31.620..31.620 rows=7869 loops=1)
| Buckets: 8192  Batches: 1  Memory Usage: 1986kB
| ->  Append  (cost=0.00..7192.56 rows=7883 width=723) (actual 
time=0.902..19.543 rows=7869 loops=1)
|   ->  Seq Scan on eric_enodeb_metrics b  (cost=0.00..0.00 rows=1 
width=718) (actual time=0.002..0.002 rows=0 loops=1)
| Filter: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   ->  Bitmap Heap Scan on eric_enodeb_201703 b_1  
(cost=605.34..7192.56 rows=7882 width=723) (actual time=0.899..14.353 rows=7869 
loops=1)
| Recheck Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
| Heap Blocks: exact=247
| ->  Bitmap Index Scan on eric_enodeb_201703_unique_idx  
(cost=0.00..603.37 rows=7882 width=0) (actual time=0.867..0.867 rows=7869 
loops=1)
|   Index Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))



-- Specify WHERE once gets good estimate, but with unnecessary scan of all 
child partitions:
|ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN 
eric_enodeb_metrics b USING (start_time, site_id) WHERE 
start_time>='2017-03-19' AND start_time<'2017-03-20';
| Gather  (cost=8310.80..316545.60 rows=9591 width=1427) (actual 
time=9012.967..9073.539 rows=7869 loops=1)
|   Workers Planned: 3
|   Workers Launched: 3
|   ->  Hash Join  (cost=7310.80..314586.50 rows=3094 width=1427) (actual 
time=8892.121..8937.245 rows=1967 loops=4)
| Hash Cond: ((b.start_time = a.start_time) AND (b.site_id = a.site_id))
| ->  Append  (cost=0.00..261886.54 rows=2015655 width=714) (actual 
time=11.464..8214.063 rows=1308903 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_metrics b  
(cost=0.00..0.00 rows=1 width=718) (actual time=0.001..0.001 rows=0 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_201510 b_1  
(cost=0.00..10954.43 rows=60343 width=707) (actual time=11.460..258.852 
rows=46766 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_201511 b_2  
(cost=0.00..10310.91 rows=56891 width=707) (actual time=18.395..237.841 
rows=44091 loops=4)
|[...]
|   ->  Parallel Seq Scan on eric_enodeb_201703 b_29  
(cost=0.00..6959.75 rows=81875 width=723) (actual time=0.017..101.969 
rows=49127 loops=4)
| ->  Hash  (cost=7192.56..7192.56 rows=7883 width=723) (actual 
time=51.843..51.843 rows=7869 

[PERFORM] self join estimate and constraint exclusion

2017-03-26 Thread Justin Pryzby
It seems when self (inner/equi) joining there's two bad alternatives: either
specify a where clause for each self-joined table and incur poor estimate and
plan, due to incorrect perceived independence of clauses, even though joined
column ought to be known equal; or, specify where clause only once, and incur
cost of joining across all partitions, due to no contraint exclusion on (at
least) one self-joined table heirarchy.

-- Specify WHERE for each table causes bad underestimate:
|ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN 
eric_enodeb_metrics b USING (start_time, site_id) WHERE 
a.start_time>='2017-03-19' AND a.start_time<'2017-03-20' AND 
b.start_time>='2017-03-19' AND b.start_time<'2017-03-20';
| Hash Join  (cost=7310.80..14680.86 rows=14 width=1436) (actual 
time=33.053..73.180 rows=7869 loops=1)
|   Hash Cond: ((a.start_time = b.start_time) AND (a.site_id = b.site_id))
|   ->  Append  (cost=0.00..7192.56 rows=7883 width=723) (actual 
time=1.394..19.414 rows=7869 loops=1)
| ->  Seq Scan on eric_enodeb_metrics a  (cost=0.00..0.00 rows=1 
width=718) (actual time=0.003..0.003 rows=0 loops=1)
|   Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp 
with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time 
zone))
| ->  Bitmap Heap Scan on eric_enodeb_201703 a_1  (cost=605.34..7192.56 
rows=7882 width=723) (actual time=1.390..14.536 rows=7869 loops=1)
|   Recheck Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   Heap Blocks: exact=247
|   ->  Bitmap Index Scan on eric_enodeb_201703_unique_idx  
(cost=0.00..603.37 rows=7882 width=0) (actual time=1.351..1.351 rows=7869 
loops=1)
| Index Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   ->  Hash  (cost=7192.56..7192.56 rows=7883 width=723) (actual 
time=31.620..31.620 rows=7869 loops=1)
| Buckets: 8192  Batches: 1  Memory Usage: 1986kB
| ->  Append  (cost=0.00..7192.56 rows=7883 width=723) (actual 
time=0.902..19.543 rows=7869 loops=1)
|   ->  Seq Scan on eric_enodeb_metrics b  (cost=0.00..0.00 rows=1 
width=718) (actual time=0.002..0.002 rows=0 loops=1)
| Filter: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
|   ->  Bitmap Heap Scan on eric_enodeb_201703 b_1  
(cost=605.34..7192.56 rows=7882 width=723) (actual time=0.899..14.353 rows=7869 
loops=1)
| Recheck Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))
| Heap Blocks: exact=247
| ->  Bitmap Index Scan on eric_enodeb_201703_unique_idx  
(cost=0.00..603.37 rows=7882 width=0) (actual time=0.867..0.867 rows=7869 
loops=1)
|   Index Cond: ((start_time >= '2017-03-19 
00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 
00:00:00-04'::timestamp with time zone))



-- Specify WHERE once gets good estimate, but with unnecessary scan of all 
child partitions:
|ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN 
eric_enodeb_metrics b USING (start_time, site_id) WHERE 
start_time>='2017-03-19' AND start_time<'2017-03-20';
| Gather  (cost=8310.80..316545.60 rows=9591 width=1427) (actual 
time=9012.967..9073.539 rows=7869 loops=1)
|   Workers Planned: 3
|   Workers Launched: 3
|   ->  Hash Join  (cost=7310.80..314586.50 rows=3094 width=1427) (actual 
time=8892.121..8937.245 rows=1967 loops=4)
| Hash Cond: ((b.start_time = a.start_time) AND (b.site_id = a.site_id))
| ->  Append  (cost=0.00..261886.54 rows=2015655 width=714) (actual 
time=11.464..8214.063 rows=1308903 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_metrics b  
(cost=0.00..0.00 rows=1 width=718) (actual time=0.001..0.001 rows=0 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_201510 b_1  
(cost=0.00..10954.43 rows=60343 width=707) (actual time=11.460..258.852 
rows=46766 loops=4)
|   ->  Parallel Seq Scan on eric_enodeb_201511 b_2  
(cost=0.00..10310.91 rows=56891 width=707) (actual time=18.395..237.841 
rows=44091 loops=4)
|[...]
|   ->  Parallel Seq Scan on eric_enodeb_201703 b_29  
(cost=0.00..6959.75 rows=81875 width=723) (actual time=0.017..101.969 
rows=49127 loops=4)
| ->  Hash  (cost=7192.56..7192.56 rows=7883 width=723) (actual 
time=51.843..51.843 rows=7869 loops=4)
|   Buckets: 8192  Batches: 1  Memory Usage: 1970kB
|   ->  Append  (cost=0.00..7192.56 rows=7883 width=723) (actual 
time=2.558..27.829 rows=7869 loops=4)
| ->  Seq Scan on 

Re: [PERFORM] Performance issue in PostgreSQL server...

2017-03-05 Thread Justin Pryzby
On Sun, Mar 05, 2017 at 08:23:08PM -0800, Jeff Janes wrote:
> On Fri, Mar 3, 2017 at 4:44 AM, Dinesh Chandra 12108 
>  wrote:
> > The below is the output for psql=> EXPLAIN ANALYZE SELECT DISTINCT
> > feature_id FROM evidence.point p INNER JOIN evidence.observation_evidence
> > oe ON p.feature_id = oe.evd_feature_id WHERE p.domain_class_id IN (11) AND
> > (p.modification_time > '2015-05-10 00:06:56.056 IST' OR
> > oe.modification_time > '2015-05-10 00:06:56.056 IST') ORDER BY feature_id
> ...
> 
> >  ->  Index Scan using point_domain_class_id_index on 
> > point p  (cost=0.00..1483472.70 rows=1454751 width=16) (actual 
> > time=27.265..142101.1 59 rows=1607491 loops=1)
> >Index Cond: (domain_class_id = 11)
> 
> Why wouldn't this be using a bitmap scan rather than a regular index scan?
> It seems like it should prefer the bitmap scan, unless the table is well
> clustered on domain_class_id.  In which case, why isn't it just faster?

Could you send:

SELECT * FROM pg_stats WHERE tablename='point' AND attname='domain_class_id' ;

.. or if that's too verbose or you don't want to share the histogram or MCV
list:

SELECT correlation FROM pg_stats WHERE tablename='point' AND 
attname='domain_class_id' ;

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] no MCV list of tiny table with unique columns

2016-11-05 Thread Justin Pryzby
On Wed, Nov 02, 2016 at 07:48:23PM -0400, Tom Lane wrote:

> There's not a lot of point in worrying about your two-row table when these
> other estimates are off by multiple orders of magnitude.  In this
> particular case my first bet would be that the planner has no idea about
> the selectivity of the conditions on "echo_tango('seven_november'::text,
> four_charlie)".  Reformulating that, or maybe making an index on it just
> so that ANALYZE will gather stats about it, could help.

Thanks, you're exactly right.  That's date_trunc('hour') BTW.

We actually already have a "new way" of doing that which avoids date_trunc, so
now I just have to get it in place for 100+ old reports..

I thought I had tried that before, but I think I was confusing myself, and
tried putting the index on the parent, which ends up with no stats since it's
empty.

With indices+analyze:
 Sort  (cost=189014.28..189014.28 rows=1 width=785) (actual 
time=25063.831..25063.886 rows=328 loops=1)
 ...

BTW:
join_collapse_limit | 8
from_collapse_limit | 8

..and changing them doesn't seem to have any effect.  By my count there's 11
tables, not counting multiply a few used multiply..

Thanks again.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] no MCV list of tiny table with unique columns

2016-11-02 Thread Justin Pryzby
On Wed, Nov 02, 2016 at 04:05:46PM -0400, Tom Lane wrote:
> Justin Pryzby <pry...@telsasoft.com> writes:
> > I believe the join is being (badly) underestimated, leading to a crappy plan
> > involving multiple nested loop joins, which takes 2.5 hours instead of a
> > handful of seconds; I believe that might be resolved by populating its MCV
> > list..
> 
> With only two rows in the table, I'm not real sure why you'd need an MCV
> list.  Could we see the actual problem query (and the other table
> schemas), rather than diving into the code first?

Sigh, yes, but understand that it's a legacy report which happens to currently
be near the top of my list of things to improve:

https://explain.depesz.com/s/5rN6

The relevant table is involved three times:

Seq Scan on two_november mike_oscar (cost=0.000..1.020 rows=2 width=18) (actual 
time=0.010..0.010 rows=2 loops=1) 
Seq Scan on echo_oscar foxtrot (cost=0.000..209.860 rows=6,286 width=13) 
(actual time=0.014..2.271 rows=5,842 loops=1) 
Seq Scan on two_november xray_yankee_alpha (cost=0.000..1.020 rows=2 width=18) 
(actual time=0.017..0.019 rows=2 loops=1) 

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] no MCV list of tiny table with unique columns

2016-11-02 Thread Justin Pryzby
We have a report query which joins (multiple times, actually) against this
trivial, tiny table:

ts=# \d bsm_to_switch
Table "public.bsm_to_switch"
 Column | Type | Modifiers 
+--+---
 bsm| text | not null
 switch | text | not null

ts=# SELECT length(bsm), length(switch) FROM bsm_to_switch;
 length | length 
+
 10 |  6
 10 |  6
(2 rows)

The column values are distinct.

I believe the join is being (badly) underestimated, leading to a crappy plan
involving multiple nested loop joins, which takes 2.5 hours instead of a
handful of seconds; I believe that might be resolved by populating its MCV
list..

..however, on reading commands/analyze.c, the issue is these columns have no
duplicates, and also postgres decides that "since the number of distinct rows
is greater than 10% of the total number of rows", that ndistinct should be -1
(meaning it scales with the table size).  That's fine, except that it then
effectively precludes populating the MCV list.

|   if (nmultiple == 0)
|   {
|   /*
|* If we found no repeated non-null values, assume it's a unique
|* column; but be sure to discount for any nulls we found.
|*/
|   stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
|   }
|   else if (track_cnt < track_max && toowide_cnt == 0 &&
|nmultiple == track_cnt)
|   {
|   /*
|* Our track list includes every value in the sample, and every
|* value appeared more than once.  Assume the column has just
|* these values.  (This case is meant to address columns with
|* small, fixed sets of possible values, such as boolean or enum
|* columns.  If there are any values that appear just once in 
the
|* sample, including too-wide values, we should assume that 
that's
|* not what we're dealing with.)
|*/
|   stats->stadistinct = track_cnt;
|   }

ts=# SELECT attname, inherited, null_frac, avg_width, n_distinct, 
most_common_vals FROM pg_stats WHERE tablename='bsm_to_switch';
 attname | inherited | null_frac | avg_width | n_distinct | most_common_vals 
-+---+---+---++--
 bsm | f | 0 |11 | -1 | 
 switch  | f | 0 | 7 | -1 | 
(2 rows)


Any ideas?  I tried setting n_distinct=2, but that seems to not have any effect
within ANALYZE itself.

ts=# SELECT attname, inherited, null_frac, avg_width, n_distinct, 
most_common_vals FROM pg_stats WHERE tablename='bsm_to_switch';
 attname | inherited | null_frac | avg_width | n_distinct | most_common_vals 
-+---+---+---++--
 bsm | f | 0 |11 |  2 | 
 switch  | f | 0 | 7 |  2 | 
(2 rows)

Thanks in advance.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-08-13 Thread Justin Pryzby
Regarding this earlier thread:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com

On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pry...@telsasoft.com> wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.

I have to point out that by "logical" I clearly meant "physical", hopefully
nobody was too misled..

On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote:
> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > Claudio Freire <klaussfre...@gmail.com> writes:
> >> So correlated index scans look extra favourable vs bitmap index scans
> >> because bitmap heap scans consider random page costs sans correlation
> >> effects (even though correlation applies to bitmap heap scans as
> >> well).
> >
> > Really?  How?  The index ordering has nothing to do with the order in
> > which heap tuples will be visited.
> 
> It is not the order itself, but the density.
> 
> If the index is read in a range scan (as opposed to =ANY scan), and
> the index lead column is correlated with the table ordering, then the
> parts of the table that need to be visited will be much denser than if
> there were no correlation.  But Claudio is saying that this is not
> being accounted for.

I didn't completely understand Claudio/Jeff here, and not sure if we're on the
same page.  For queries on these tables, the index scan was very slow, due to
fragmented index on non-unique column, and seq scan would have been (was)
faster (even if it means reading 70GB and filtering out 6 of 7 days' data).
That was resolved by added a nightly reindex job (..  which sometimes competes
with other maintenance and has trouble running every table every night).

But I did find that someone else had previously reported this problem (in a
strikingly similar context and message, perhaps clearer than mine):
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com

I also found this older thread:
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com

There was mention of a TODO item:

* Compute index correlation on CREATE INDEX and ANALYZE, use it for index
* scan cost estimation

.. but perhaps I misunderstand and that's long since resolved ?

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-03 Thread Justin Pryzby
On Fri, Jun 03, 2016 at 06:26:33PM -0300, Claudio Freire wrote:
> On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pry...@telsasoft.com> wrote:
> >> > First, I found I was able to get 30-50min query results on full week's 
> >> > table by
> >> > prefering a seq scan to an index scan.  The row estimates seemed fine, 
> >> > and the
> >> > only condition is the timestamp, so the planner's use of index scan is as
> >> > expected.
> >>
> >> Can you show us the query?  I would expect a bitmap scan of the index
> >> (which would do what you want, but even more so), instead.
> > See explain, also showing additional tables/views being joined.  It's NOT 
> > doing
> > a bitmap scan though, and I'd be interested to find why; I'm sure that 
> > would've
> > improved this query enough so it never would've been an issue.
> > https://explain.depesz.com/s/s8KP
> >
> >  ->  Index Scan using 
> > cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on 
> > cdrs_huawei_pgwrecord_2016_05_01  (cost=0.56..1601734.57 rows=8943848 
> > width=349)
> >Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp 
> > without time zone) AND (recordopeningtime < '2016-05-08 
> > 00:00:00'::timestamp without time zone))
> 
> Please show your guc settings ( see
> https://wiki.postgresql.org/wiki/Server_Configuration )
> 
> A plan node like that, if it would result in I/O, with proper
> configuration should have selected a bitmap index/heap scan. If it
> didn't, it probably thinks it has more cache than it really does, and
> that would mean the wrong setting was set in effective_cache_size.

ts=# SELECT name, current_setting(name), SOURCE FROM pg_settings WHERE 
SOURCE='configuration file';
 dynamic_shared_memory_type  | posix   | configuration file
 effective_cache_size| 64GB| configuration file
 effective_io_concurrency| 8   | configuration file
 huge_pages  | try | configuration file
 log_autovacuum_min_duration | 0   | configuration file
 log_checkpoints | on  | configuration file
 maintenance_work_mem| 6GB | configuration file
 max_connections | 200 | configuration file
 max_wal_size| 4GB | configuration file
 min_wal_size| 6GB | configuration file
 shared_buffers  | 8GB | configuration file
 wal_compression | on  | configuration file
 work_mem| 1GB | configuration file

I changed at least maintenance_work_mem since I originally wrote, to try to
avoid tempfiles during REINDEX (though I'm not sure it matters, as the
tempfiles are effective cached and may never actually be written).

It's entirely possible those settings aren't ideal.  The server has 72GB RAM.
There are usually very few (typically n<3 but at most a handful) nontrivial
queries running at once, if at all.

I wouldn't expect any data that's not recent (table data last 2 days or index
from this month) to be cached, and wouldn't expect that to be entirely cached,
either:

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_2016_05_..$';
gb | 425.783050537109

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_2016_05_...*idx';
gb | 60.0909423828125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_201605.*idx';
gb | 4.85528564453125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_201605$';
gb | 86.8688049316406

As a test, I did SET effective_cache_size='1MB', before running explain, and
still does:

|->  Index Scan using 
cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on 
cdrs_huawei_pgwrecord_2016_05_29  (cost=0.44..1526689.49 rows=8342796 width=355)
|  Index Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
00:00:00'::timestamp without time zone))

I Set enable_indexscan=0, and got:

|->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29  
(cost=168006.10..4087526.04 rows=8342796 width=355)
|  Recheck Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
00:00:00'::timestamp without time zone))
|  ->  Bitmap Index Scan on 
cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx  (cost=0.00..165920.40 
rows=8342796 width=0)
|Index Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30

Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-25 Thread Justin Pryzby
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote:
> > But note the non-uniqueness of the index column:
> > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
> > recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 
> > 1 ORDER BY 2 DESC;
> >   recordopeningtime  | count
> > -+---
> >  2016-05-21 12:17:29 |   176
> >  2016-05-21 12:17:25 |   171
> >  2016-05-21 13:11:33 |   170
> >  2016-05-21 10:20:02 |   169
> >  2016-05-21 11:30:02 |   167
> > [...]
> 
> That is not that much duplication.  You aren't going to have dozens or
> hundreds of leaf pages all with equal values.   (and you only showed
> the most highly duplicated ones, presumably the average is much less)

Point taken, but it's not that great of a range either:

ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 
ORDER BY 2 LIMIT 19;
  recordopeningtime  | count 
-+---
 2016-05-21 03:10:05 |44
 2016-05-21 03:55:05 |44
 2016-05-21 04:55:05 |45

ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22';
-[ RECORD 1 ]
count | 86400

ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22';
-[ RECORD 1 ]--
count | 8892865

> > We have an daily analytic query which processes the previous day's data.  
> > For
> > new child tables, with only 1 days data loaded, this runs in ~30min, and for
> > child tables with an entire week's worth of data loaded, takes several hours
> > (even though both queries process the same amount of data).
> 
> For an append only table, why would the first day of a new partition
> be any less fragmented than that same day would be a week from now?
> Are you sure it isn't just that your week-old data has all been aged
> out of the cache?
I don't think it's cache effect, since we're not using the source table for
(maybe anything) else the entire rest of the day.  Server has 72GB RAM, same
size one the largest of the tables being joined (beginning) at 4am.

I didn't mean that a given day is more fragmented now than it was last week
(but I don't know, either).  I guess when we do a query on the table with ~32
hours of data in, it might do a seq scan rather than index scan, too.

Compare the end of month partition tables:
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx');
leaf_fragmentation | 48.38
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx');
leaf_fragmentation | 48.66
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx');
leaf_fragmentation | 48.27
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx');
leaf_fragmentation | 48

This one I reindexed as a test:
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx');
leaf_fragmentation | 0.01

.. and query ran in ~30min (reran a 2nd time, with cache effects: 25min).

> > First, I found I was able to get 30-50min query results on full week's 
> > table by
> > prefering a seq scan to an index scan.  The row estimates seemed fine, and 
> > the
> > only condition is the timestamp, so the planner's use of index scan is as
> > expected.
> 
> Can you show us the query?  I would expect a bitmap scan of the index
> (which would do what you want, but even more so), instead.
See explain, also showing additional tables/views being joined.  It's NOT doing
a bitmap scan though, and I'd be interested to find why; I'm sure that would've
improved this query enough so it never would've been an issue.
https://explain.depesz.com/s/s8KP

 ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on 
cdrs_huawei_pgwrecord_2016_05_01  (cost=0.56..1601734.57 rows=8943848 width=349)
   Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp 
without time zone) AND (recordopeningtime < '2016-05-08 00:00:00'::timestamp 
without time zone))

> > AFAICT what's happening is that the index scan was returning pages
> > nonsequentially.  strace-ing the backend showed alternating lseek()s and
> > read()s, with the offsets not consistently increasing (nor consistently
> > decreasing):
..
> 
> Which of those are the table, and which the index?
Those weren't necessarily strace of the same process; I believe both of these
were table data/heap, and didn't include any index access.

> Something doesn't add up here.  How could an index of an append-only
> table possibly become that 

Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-25 Thread Justin Pryzby
On Tue, May 24, 2016 at 09:16:20PM -0700, Peter Geoghegan wrote:
> On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pry...@telsasoft.com> wrote:
> > Postgres seems to assume that the high degree of correlation of the table
> > column seen in pg_stats is how it will get data from the index scan, which
> > assumption seems to be very poor on what turns out to be a higly fragmented
> > index.  Is there a way to help it to understand otherwise??
> 
> Your complaint is vague. Are you complaining about the planner making
> a poor choice? I don't think that's the issue here, because you never
> made any firm statement about the planner making a choice that was
> worth than an alternative that it had available.

I was thinking there a few possible places to make improvements: the planner
could have understood that scans of non-unique indices don't result in strictly
sequential scans of the table, the degree of non-sequentialness being
determined by the column statistics, and perhaps by properties of the index
itself.

Or the INSERT code or btree scan could improve on this, even if tuples aren't
fully ordered.

> If you're arguing for the idea that B-Trees should reliably keep
> tuples in order by a tie-break condition, that seems difficult to
> implement, and likely not worth it in practice.

I had the awful idea to change the index to use (recordopeningtime,ctid).
Maybe somebody will convince me otherwise, but may actually work better than
trying to reindex this table daily by 4am.

Thanks,
Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-24 Thread Justin Pryzby
Summary: Non-unique btree indices are returning CTIDs for rows with same
value of indexed column not in logical order, imposing a high performance
penalty.

Running PG 9.5.3 now, we have a time-based partitions of append-only tables
with data loaded from other sources.  The tables are partitioned by time, and
timestamp column has an non-unique, not-null btree index.

The child tables are each ~75GB and expected to keep growing.  For a child
table with a week's worth of data:
relpages  | 11255802
reltuples | 5.90502e+07

The data is loaded shortly after it's available, so have high correlation in
pg_statistic:
[pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM 
pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 
'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head
tablename | correlation | n_distinct 
--+-+
 cdrs_huawei_pgwrecord|0.97 | 102892
 cdrs_huawei_pgwrecord_2016_02_15 |0.999658 |  96145
 cdrs_huawei_pgwrecord_2016_02_22 |0.43 |  91916
 cdrs_huawei_pgwrecord_2016_02_29 |0.997219 |  50341
 cdrs_huawei_pgwrecord_2016_03_01 |0.47 |  97485

But note the non-uniqueness of the index column:
ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 
ORDER BY 2 DESC;
  recordopeningtime  | count
-+---
 2016-05-21 12:17:29 |   176
 2016-05-21 12:17:25 |   171
 2016-05-21 13:11:33 |   170
 2016-05-21 10:20:02 |   169
 2016-05-21 11:30:02 |   167
[...]

We have an daily analytic query which processes the previous day's data.  For
new child tables, with only 1 days data loaded, this runs in ~30min, and for
child tables with an entire week's worth of data loaded, takes several hours
(even though both queries process the same amount of data).

First, I found I was able to get 30-50min query results on full week's table by
prefering a seq scan to an index scan.  The row estimates seemed fine, and the
only condition is the timestamp, so the planner's use of index scan is as
expected.

AFAICT what's happening is that the index scan was returning pages
nonsequentially.  strace-ing the backend showed alternating lseek()s and
read()s, with the offsets not consistently increasing (nor consistently
decreasing):
% sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
lseek(773, 1059766272, SEEK_SET)= 1059766272
lseek(773, 824926208, SEEK_SET) = 824926208
lseek(773, 990027776, SEEK_SET) = 990027776
lseek(773, 990330880, SEEK_SET) = 990330880
lseek(773, 1038942208, SEEK_SET)= 1038942208
lseek(773, 1059856384, SEEK_SET)= 1059856384
lseek(773, 977305600, SEEK_SET) = 977305600
lseek(773, 990347264, SEEK_SET) = 990347264
lseek(773, 871096320, SEEK_SET) = 871096320

.. and consecutive read()s being rare:
read(802, "g"..., 8192) = 8192
lseek(802, 918003712, SEEK_SET) = 918003712
read(802, "c"..., 8192) = 8192
lseek(802, 859136000, SEEK_SET) = 859136000
read(802, "a"..., 8192) = 8192
lseek(802, 919601152, SEEK_SET) = 919601152
read(802, "d"..., 8192) = 8192
lseek(802, 905101312, SEEK_SET) = 905101312
read(802, "c"..., 8192) = 8192
lseek(801, 507863040, SEEK_SET) = 507863040
read(801, "p"..., 8192) = 8192
lseek(802, 914235392, SEEK_SET) = 914235392
read(802, "c"..., 8192) = 8192

I was able to see great improvement without planner parameters by REINDEX the
timestamp index.  My theory is that the index/planner doesn't handle well the
case of many tuples with same column value, and returns pages out of logical
order.  Reindex fixes that, rewriting the index data with pages in order
(confirmed with pageinspect), which causes index scans to fetch heap data more
or less monotonically (if not consecutively).  strace shows that consecutive
read()s are common (without intervening seeks).  I gather this allows the OS
readahead to kick in.

Postgres seems to assume that the high degree of correlation of the table
column seen in pg_stats is how it will get data from the index scan, which
assumption seems to be very poor on what turns out to be a higly fragmented
index.  Is there a way to help it to understand otherwise??

Maybe this is a well-understood problem/deficiency; but, is there any reason
why Btree scan can't sort result by ctid for index tuples with same column
value (_bt_steppage() or btgettuple())?  Or maybe the problem could be
mitigated by changing the behavior during INESRT?  In the meantime, I'll be
implementing a reindex job.

Thanks,
Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:

Re: [PERFORM] Clarification on using pg_upgrade

2016-03-04 Thread Justin Pryzby
On Fri, Mar 04, 2016 at 02:27:59PM -0800, Tory M Blue wrote:
> If my data is located in /data
> 
> and I link to a new dir in /data1,  what actually happens. do I end up with
> 2 file systems and links and thus am not able to delete or cleanup any old
> data, or how does this work?
> 
> Also will the reindex creation still happen with this type of in-place
> upgrade, as if so, then it may not save too much time vs a dump/import.

Since you have the space, you can do a test upgrade; make a dump of the
essential tables (or the entire thing) and restore it to another instance,
perhaps even something run from your /home.

pg_upgrade --link makes hardlinks for tables and indices (same as cp -l), so
uses very little additional space.  Note, that means that both must be within
the filesystem (/data).  You should understand about hardinks and inodes
otherwise this will lead to confusion and mistakes.

Indexes don't need to be rebuilt afterwards.  I've upgraded ~35 customers to
9.5 already, some as big as 5TB.  So far the disruption has been at most 30min
(not counting ANALYZE afterwards).

When I use pg_upgrade, after stopping the old instance, I rename the data dir
(under centos, /var/lib/pgsql/9.4~).  Then pg_upgrade makes links in 9.5/.
Renaming has the advantage that the old instances can't be accidentally
started; and, makes it much easier to believe that it's safe to remove the 9.4~
afterwards.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] dump time increase by 1h with new kernel

2009-10-03 Thread Justin Pryzby
[I got no response on -general for a few days so I'm trying here]

When we upgraded from linux-2.6.24 to linux-2.6.27, our pg_dump
duration increased by 20% from 5 hours to 6.  My first attempt at
resolution was to boot with elevator=deadline.  However that's
actually the default IO scheduler in both kernels.

The two dmesg's are at:
https://www.norchemlab.com/tmp/linux-2.6.24-22.45-server
https://www.norchemlab.com/tmp/linux-2.6.27-14.41-server

The database partition is: xfs / lvm / aic79xx / scsi.

Booting back into the .24 kernel brings the pg_dump back down to 5
hours (for daily 20GB output compressed by pg_dump -Fc).

Does anyone know what might be different which could cause such a
drastic change?

Thanks,
Justin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] dump time increase by 1h with new kernel

2009-10-02 Thread Justin Pryzby
[I got no response on -general for a few days so I'm trying here]

When we upgraded from linux-2.6.24 to linux-2.6.27, our pg_dump
duration increased by 20% from 5 hours to 6.  My first attempt at
resolution was to boot with elevator=deadline.  However that's
actually the default IO scheduler in both kernels.

The two dmesg's are at:
https://www.norchemlab.com/tmp/linux-2.6.24-22.45-server
https://www.norchemlab.com/tmp/linux-2.6.27-14.41-server

The database partition is: xfs / lvm / aic79xx / scsi.

Booting back into the .24 kernel brings the pg_dump back down to 5
hours (for daily 20GB output compressed by pg_dump -Fc).

Does anyone know what might be different which could cause such a
drastic change?

Thanks,
Justin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance