[PERFORM] CREATE STATISTICS and join selectivity
I just noticed that PG10 CREATE STATISTICS (dependencies) doesn't seem to work for joins on multiple columns; is that right? With statistics on table for 2017 but not 20171110: ts=# CREATE STATISTICS x ON site_id,sect_id FROM eric_enodeb_cell_2017; ts=# ANALYZE VERBOSE eric_enodeb_cell_2017; ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, sect_id) WHERE a.site_id=318 AND sect_id=1489; Nested Loop (cost=0.83..4565.09 rows=1 width=0) (actual time=23.595..69.541 rows=96 loops=1) => bad estimate on redundant WHERE WITHOUT multivar statistics ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN eric_enodeb_cell_2017 b USING(start_time, sect_id) WHERE a.site_id=318 AND sect_id=1489; Nested Loop (cost=0.83..4862.41 rows=96 width=0) (actual time=0.034..3.882 rows=96 loops=1) => good estimate on redundant WHERE WITH multivar statistics ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, sect_id); Merge Join (cost=18249.85..19624.18 rows=54858 width=0) (actual time=157.252..236.945 rows=55050 loops=1) => good estimate on JOIN on SECT_id without stats ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, site_id); Merge Join (cost=0.83..14431.81 rows=261499 width=0) (actual time=0.031..259.382 rows=262638 loops=1) => good estimate on JOIN on SITE_id without stats ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN eric_enodeb_cell_2017 b USING(start_time, site_id); Merge Join (cost=0.83..14706.29 rows=268057 width=0) (actual time=37.360..331.276 rows=268092 loops=1) => good estimate on JOIN on SITE_id with stats ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN eric_enodeb_cell_2017 b USING(start_time, sect_id); Merge Join (cost=18560.89..19959.67 rows=55944 width=0) (actual time=130.865..198.439 rows=55956 loops=1) => good estimate on JOIN on SECT_id with stats ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_2017 a JOIN eric_enodeb_cell_2017 b USING(start_time, sect_id, site_id); Gather (cost=1000.83..1.06 rows=460 width=0) (actual time=1.686..149.707 rows=55956 loops=1) => poor estimate on redundant JOIN WITH stats (??) I've already fixed our reports to avoid this kind of thing and support our PG95 customers, but I tentatively would've expected PG10 MV stats to "know" that USING(site_id, sect_id) is no more selective than USING(sect_id), same as it knows that's true for WHERE site... AND sect Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query performance issue
On Wed, Nov 15, 2017 at 03:03:39PM +0530, Samir Magar wrote: > I am having performance issues with one of the query. > The query is taking 39 min to fetch 3.5 mil records. > > I want to reduce that time to 15 mins. > could you please suggest something to its performance? > "HashAggregate (cost=4459.68..4459.69 rows=1 width=27) (actual > time=2890035.403..2892173.601 rows=3489861 loops=1)" Looks to me like the problem is here: > " -> Index Only Scan using idxdq7 on dlr_qlfy (cost=0.43..4.45 ROWS=1 > width=16) (actual time=0.009..0.066 ROWS=121 loops=103987)" > "Index Cond: ((qlfy_grp_id = dlr_grp.dlr_grp_id) AND (qlf_flg = > 'N'::bpchar) AND (cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id))" > "Heap Fetches: 0" Returning 100x more rows than expected and bubbling up through a cascade of nested loops. Are those 3 conditions independent ? Or, perhaps, are rows for which "qlfy_grp_id=dlr_grp.dlr_grp_id" is true always going to have "cog_grp_id = dlr_grp_dlr_xref_1.dlr_grp_id" ? Even if it's not "always" true, if rows which pass the one condition are more likely to pass the other condition, this will cause an underestimate, as obvserved. You can do an experiment SELECTing just from those two tables joined and see if you can reproduce the problem with poor rowcount estimate (hopefully in much less than 15min). If you can't drop one of the two conditions, you can make PG treat it as a single condition for purpose of determining expected selectivity, using a ROW() comparison like: ROW(qlfy_grp_id, cog_grp_id) = ROW(dlr_grp.dlr_grp_id, dlr_grp_dlr_xref_1.dlr_grp_id) If you're running PG96+ you may also be able to work around this by adding FKs. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] overestimate on empty table
On Fri, Nov 10, 2017 at 04:19:41PM -0500, Tom Lane wrote: > Justin Pryzby <pry...@telsasoft.com> writes: > > (or, the opposite of the more common problem) > > As the queued_alters table is typically empty (and autoanalyzed with > > relpages=0), I see "why": > > > ./src/backend/optimizer/util/plancat.c > > |if (curpages < 10 && > > |rel->rd_rel->relpages == 0 && > > |!rel->rd_rel->relhassubclass && > > |rel->rd_rel->relkind != RELKIND_INDEX) > > |curpages = 10; > > So I'm sure you read the comment above that, too. > One idea is to say that relpages = reltuples = 0 is only the state that > prevails for a freshly-created table, and that VACUUM or ANALYZE should > always set relpages to at least 1 even if the physical size is zero. > Dunno if that would confuse people. What about adding && rel->rd_rel->reltuples==0, and make VACUUM/ANALYZE instead set only reltuples=1, since that's already done at costsize.c: clamp_row_est() and therefor no additional confusion? Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] overestimate on empty table
(or, the opposite of the more common problem) I wrote this query some time ago to handle "deferred" table-rewriting type promoting ALTERs of a inheritence children, to avoid worst-case disk usage altering the whole heirarchy, and also locking the entire heirarchy against SELECT and INSERT. ts=# explain analyze SELECT child c, parent p, array_agg(colpar.attname::text) cols, array_agg(colpar.atttypid::regtype) AS types FROM queued_alters qa JOIN pg_attribute colpar ON qa.parent::regclass=colpar.attrelid JOIN pg_attribute colcld ON qa.child::regclass=colcld.attrelid WHERE colcld.attname=colpar.attname AND colpar.atttypid!=colcld.atttypid GROUP BY 1,2 ORDER BY regexp_replace(child, '.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$', '\3\5') DESC, -- by MM child~'_[0-9]{6}$' DESC, -- monthly tables first regexp_replace(child, '.*_', '') DESC -- by MMDD LIMIT 1; Unfortunately we get this terrible plan: Limit (cost=337497.59..337497.60 rows=1 width=184) (actual time=2395.283..2395.283 rows=0 loops=1) -> Sort (cost=337497.59..337500.04 rows=980 width=184) (actual time=2395.281..2395.281 rows=0 loops=1) Sort Key: (regexp_replace((qa.child)::text, '.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$'::text, '\3\5'::text)) DESC, (((qa.child)::text ~ '_[0-9]{6}$'::text)) DESC, (regexp_replace((qa.child)::text, '.*_'::text, ''::text)) DESC Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=337470.64..337492.69 rows=980 width=184) (actual time=2395.273..2395.273 rows=0 loops=1) Group Key: qa.child, qa.parent -> Gather (cost=293727.20..336790.89 rows=54380 width=123) (actual time=2395.261..2395.261 rows=0 loops=1) Workers Planned: 3 Workers Launched: 3 -> Hash Join (cost=292727.20..330352.89 rows=17542 width=123) (actual time=2341.328..2341.328 rows=0 loops=4) Hash Cond: qa.child)::regclass)::oid = colcld.attrelid) AND (colpar.attname = colcld.attname)) Join Filter: (colpar.atttypid <> colcld.atttypid) -> Merge Join (cost=144034.27..151009.09 rows=105280 width=123) (actual time=514.820..514.820 rows=0 loops=4) Merge Cond: (colpar.attrelid = (((qa.parent)::regclass)::oid)) -> Sort (cost=143965.78..145676.59 rows=684322 width=72) (actual time=514.790..514.790 rows=1 loops=4) Sort Key: colpar.attrelid Sort Method: external merge Disk: 78448kB -> Parallel Seq Scan on pg_attribute colpar (cost=0.00..77640.22 rows=684322 width=72) (actual time=0.011..164.106 rows=445582 loops=4) -> Sort (cost=68.49..70.94 rows=980 width=55) (actual time=0.031..0.031 rows=0 loops=3) Sort Key: (((qa.parent)::regclass)::oid) Sort Method: quicksort Memory: 25kB -> Seq Scan on queued_alters qa (cost=0.00..19.80 rows=980 width=55) (actual time=0.018..0.018 rows=0 loops=3) -> Hash (cost=92010.97..92010.97 rows=2121397 width=72) (actual time=1786.056..1786.056 rows=1782330 loops=4) Buckets: 2097152 Batches: 2 Memory Usage: 106870kB -> Seq Scan on pg_attribute colcld (cost=0.00..92010.97 rows=2121397 width=72) (actual time=0.027..731.554 rows=1782330 loops=4) As the queued_alters table is typically empty (and autoanalyzed with relpages=0), I see "why": ./src/backend/optimizer/util/plancat.c |if (curpages < 10 && |rel->rd_rel->relpages == 0 && |!rel->rd_rel->relhassubclass && |rel->rd_rel->relkind != RELKIND_INDEX) |curpages = 10; Indeed it works much better if I add a child table as a test/kludge: -> Sort (cost=306322.49..306323.16 rows=271 width=403) (actual time=4.945..4.945 rows=0 loops=1) Sort Key: (regexp_replace((qa.child)::text, '.*_((([0-9]{4}_[0-9]{2})_[0-9]{2})|(([0-9]{6})([0-9]{2})?))$'::text, '\3\5'::text)) DESC, (((qa.child)::text ~ '_[0-9]{6}$'::text)) DESC, (regexp_replace((qa.child)::text, '.*_'::text, ''::text)) DESC Sort Method: quicksort Memory: 25kB -> GroupAggregate (cost=306089.46..306321.13 rows=271 width=403) (actual time=4.938..4.938 rows=0 loops=1) Group Key: qa.child, qa.parent -> Sort (cost=306089.46..306127.06 rows=15038 width=342) (actual time=4.936..4.936 rows=0 loops=1) Sort Key: qa.child, qa.parent Sort Method: quicksort Memory: 25kB -> Gather
Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6
On 11/6/17, 9:21 AM, "Justin Pryzby" <pry...@telsasoft.com> wrote: > see if statistics improve: > SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, > tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, > FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP > BY 1,2,3,4,5 ORDER BY 1 On Mon, Nov 06, 2017 at 09:12:01PM +, Adam Torres wrote: > I changed the statistics on av.customer_id as suggested and the number > returned by pg_stats went from 202,333 to 904,097. Do you mean n_distinct ? It' be useful to see that query on pg_stats. Also I don't know that we've seen \d output for the tables (or at least the joined columns) or the full query ? > There are 11.2 million distinct customer_ids on the 14.8 million vehicle > records. If there's so many distinct ids, updating stats won't help the rowcount estimate (and could even hurt) - it can only store 1 most-common-values. Are there as many distinct values for cc.id ? I would try to reproduce the rowcount problem with a minimal query: explain analyze SELECT FROM av JOIN cc ON av.customer_id=cc.id; --WHERE cc.id<99; Maybe the rows estimate is okay for some values and not for others, so maybe you need to try various WHERE (with JOIN an additional tables if need be...but without reimplementing the whole query). I just noticed there are two conditions on dealer_id, one from table av and one from table cc_1. It seems likely those are co-related/non-independent conditions..but postgres probably doesn't know that (unless you used PG96 FK logic, or PG10 multi-variable stats). As a test, you could try dropping one of those conditions, or maybe a hacky change like ROW(av.dealer_id, cc_1.dealer_id)=ROW('EC79', 'EC79'), which postgres estimates as no more selective than a single equality test. BTW this is all from src/backend/utils/adt/selfuncs.c. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6
On Mon, Nov 06, 2017 at 01:18:00PM +, Adam Torres wrote: > Good morning all, > > We have a problem with performance after upgrading from 9.3 to 9.6 where > certain queries take 9 times longer to run. On our initial attempt to > upgrade, we noticed the system as a whole was taking longer to run through > normal daily processes. The query with the largest run time was picked to > act as a measuring stick. > https://explain.depesz.com/s/z71u > Planning time: 8.218 ms > Execution time: 639319.525 ms > > Same query as run on 9.3 > https://explain.depesz.com/s/gjN3 > Total runtime: 272897.150 ms Actually it looks to me like both query plans are poor.. ..because of this: | Hash Join (cost=85,086.25..170,080.80 ROWS=40 width=115) (actual time=32.673..84.427 ROWS=13,390 loops=1) |Hash Cond: (av.customer_id = cc_1.id) If there are a large number of distinct customer_ids (maybe with nearly equal frequencies), it might help to ALTER TABLE av ALTER customer_id SET STATISTICS 400 ..same for cc_1.id. And re-analyze those tables (are they large??). see if statistics improve: SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP BY 1,2,3,4,5 ORDER BY 1 Goal is to get at least an accurate value for n_distinct (but preferably also storing the most frequent IDs). I wouldn't bother re-running the query unless you find that increasing stats target causes the plan to change. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance drop after upgrade (9.6 > 10)
On Tue, Oct 24, 2017 at 04:15:59PM +0200, johannes graën wrote: > Hi Pavel, *, > > you were right with ANALYZing the DB first. However, even after doing > so, I frequently see Seq Scans where an index was used before. This > usually cooccurs with parallelization and looked different before > upgrading to 10. I can provide an example for 10 [1], but I cannot > generate a query plan for 9.6 anymore. > > Any ideas what makes the new version more seqscanny? Is it because max_parallel_workers_per_gather now defaults to 2 ? BTW, I would tentatively expect a change in default to be documented in the release notes but can't see that it's. 77cd477c4ba885cfa1ba67beaa82e06f2e182b85 Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance drop after upgrade (9.6 > 10)
On Tue, Oct 24, 2017 at 04:15:59PM +0200, johannes graën wrote: > upgrading to 10. I can provide an example for 10 [1], but I cannot > generate a query plan for 9.6 anymore. You could (re)install PG96 alongside PG10 and run a copy of the DB (even from your homedir, or on a difference server) and pg_dump |pg_restore the relevant tables (just be sure to specify the alternate host/port/user/etc as needed for the restore invocation). Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Partitioned table - scans through every partitions
On Fri, Aug 25, 2017 at 03:36:29PM +, Aniko Belim wrote: > Hi, > > We have an issue with one of our partitioned tables. It has a column with > timestamp without time zone type, and we had to partition it daily. To do > that, we created the following constraints like this example: > CHECK (to_char(impression_time, 'MMDD'::text) = '20170202'::text) > > > The problem we’re facing is no matter how we’re trying to select from it, it > scans through every partitions. > It scans through every partitions. Shouldn’t it only scan the > dfp_in_network_impressions.dfp_in_network_impressions_20170202 child table? > Or we missing something? > Any advice/help would highly appreciated. https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html#DDL-PARTITIONING-CAVEATS |The following caveats apply to constraint exclusion: |Constraint exclusion only works when the query's WHERE clause contains |constants (or externally supplied parameters). For example, a comparison |against a non-immutable function such as CURRENT_TIMESTAMP cannot be |optimized, since the planner cannot know which partition the function value |might fall into at run time. ... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance Issue -- "Materialize"
On Sat, Aug 19, 2017 at 10:37:56AM -0700, anand086 wrote: > +---+| > > QUERY PLAN > > |+---+| > Aggregate (cost=351405.08..351405.09 rows=1 width=8) > Would you send explain ANALYZE and not just explain ? Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Very poor read performance, query independent
On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote: > On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire> wrote: > > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau > > wrote: > >> Claudio, > >> > >> At one moment > >> during the query, there is a write storm to the swap drive (a bit like this > >> case: > >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). > >> I can hardly explain it as there is plenty of memory on this server. > > > > That sounds a lot like NUMA zone_reclaim issues: > > > > https://www.postgresql.org/message-id/500616cb.3070...@2ndquadrant.com > > I realize you have zone_reclaim_mode set to 0. Still, the symptoms are > eerily similar. Did you look at disabling KSM and/or THP ? sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run' https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zfsol14rqnqa8hwz1r4rsndlr1b...@mail.gmail.com https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270...@yahoo.es https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.p...@n5.nabble.com https://www.postgresql.org/message-id/cal_0b1tjozcx3lo3eve1rqgat%2bjj_q7w4pkj87wfwwxbtug...@mail.gmail.com https://www.postgresql.org/message-id/556e2068.7070...@vuole.me https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/ http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/ Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
estimate correlation of index separately from table (Re: [PERFORM] index fragmentation on insert-only table with non-unique column)
Months ago I reported an issue with very slow index scan of tables with high "correlation" of its indexed column, due to (we concluded at the time) duplicate/repeated values of that column causing many lseek()s. References: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com My interim workarounds were an reindex job and daily granularity partitions (despite having an excessive number of child tables) to query execution time]] to encourage seq scans for daily analysis jobs rather than idx scan. I've now cobbled together a patch to throw around and see if we can improve on that. I tried several changes, hoping to discourage index scan. The logic that seems to most accurately reflect costs has two changes: Postgres' existing behavior stores table correlation (heap value vs. position) but doesn't look at index correlation, so can't distinguish between a just-built index, and a highly fragmented index, or one with highly-nonsequential TIDs. My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine correlation of heap TID vs index tuple logical location (as opposed to the table correlation, computed as: heap TID vs. heap value). The second change averages separate correlation values of small lists of 300 consecutive TIDs, rather than the course-granularity/aggregate correlation of a small sample of pages across the entire index. Postgres' existing sampling is designed to give an even sample across all rows. An issue with this course-granularity correlation is that the index can have a broad correlation (to physical heap location), but with many small-scale deviations, which don't show up due to sampling a relatively small fraction of a large table; and/or the effect of the deviations is insignificant/noise and correlation is still computed near 1. I believe the "large scale" correlation that postgres computes from block sample fails to well represent small-scale uncorrelated reads which contribute large number of random reads, but not included in planner cost. Not only are the index reads highly random (which the planner already assumes), but the CTIDs referenced within a given btree leaf page are also substantially non-sequential. It seems to affect INSERTs which, over a short interval of time have a small-moderate range of column values, but which still have a strong overall trend in that column WRT time (and heap location). Think of inserting a "normal" distribution of timestamps centered around now(). My original report shows lseek() for nearly every read on the *heap*. The original problem was on a table of telecommunications CDRs, indexed by "record opening time" (start time) and with high correlation value in pg_stats. We import data records from a file, which is probably more or less in order of "end time". That still displays broad correlation on "start time", but individual TIDs are nowhere near sequential. Two phone calls which end in the same 1 second interval are not unlikely to have started many minutes apart... but two calls which end within the same second are very likely to have started within an hour of each other.. since typical duration is <1h. But, insert volume is high, and there are substantial repeated keys, so the portion of an index scan returning CTIDs for some certain key value (timestamp with second resolution in our case) ends up reading heap tuples for a non-negligible fraction of the table: maybe only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ... which is still 300 seeks, and takes long enough that cache is inadequate to substantially mitigate the cost. It's not clear to me that there's a good way to evenly *sample* a fraction of the index blocks in a manner which is agnostic to different AMs. Scanning the entirety of all indices on relations during (auto) analyze may be too expensive. So I implemented index scan of the MCV list. I'm guessing this might cause the correlation to be under-estimated, and prefer bitmap scans somewhat more than justified, due to btree insertion logic for repeated keys, to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with eg. normally distributed floating point values (or timestamp with fractional seconds). I ran pageinspect on a recent child of the table that triggered the original report: ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET 1; itemoffset | ctid + 2 | (81,4) 3 | (5,6) 4 | (6,5) 5 | (9,1) 6 | (10,1) 7 | (14,1) 8 | (21,8) 9 | (25,1) 10 | (30,5) 11 |
Re: [PERFORM] Re: join under-estimates with ineq conditions
I never heard back but was hoping for some feedback/discussion about this 2nd problem/patch. just a reminder - Thanks On Thu, Jun 08, 2017 at 11:05:38AM -0500, Justin Pryzby wrote: > On Mon, Jun 05, 2017 at 05:02:32PM -0400, Tom Lane wrote: > > Justin Pryzby <pry...@telsasoft.com> writes: > > > diff --git a/src/backend/utils/adt/selfuncs.c > > > b/src/backend/utils/adt/selfuncs.c > > > + if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows; > > > + if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows; > > > > I don't like this change too much. > > Thanks for your analysis ;) > > I have a 2nd patch which improves the 2nd case I mentioned.. > > > I note for instance that this patch would do nothing at all for the toy > > >> There's still an 2nd issue which this doesn't address, having to do with > >> joins > >> of tables with full/complete MCV lists, and selective queries on those > >> tables, > >> as demonstrated by the artificial test: > >> > >> > postgres=# CREATE TABLE t(i INT); > >> > postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) > >> > i,generate_series(1,99);ANALYZE t; > >> > postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) > >> > frac_MCV, tablename, attname, n_distinct, > >> > array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) > >> > n_hist, (SELECT MAX(x) FROM unnest(most_common_vals::text::text[]) x) > >> > maxmcv, > >> > (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] > >> > maxhist FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY > >> > 1,2,3,4,5,6,7,8 ORDER BY 1 DESC; > > I pointed out that there were two issues, both involving underestimates from > querying a fraction of a table using inequality condition. One due to join > estimate based on "nd" (and not substantially based on MCV), and one due to > frequencies associated with MCV list (and not substantially falling back to > estimate from "nd"). > > I made another patch to address the 2nd issue, which affects our > pre-aggregated > tables (which are partitioned by month, same as the raw tables). The > aggregated tables are the result of something like SELECT start_time::date, > k1, > k2, ..., sum(a), avg(b) ... GROUP BY 1,2,3, so have many fewer rows, and nd > for > start_time::date column would be at most 31, so MCV list would be expected to > be complete, same as the "toy" example I gave. > > Sometimes when we query the aggregated tables for a small number of days we > get > underestimate leading to nested loops.. > > Without patch: > Merge Join (cost=339.59..341.57 rows=99 width=4) (actual time=10.190..17.430 > rows=9801 loops=1) > > With patch: > DEBUG: ndfactor 99.00 99.00 > DEBUG: nmatches 99 matchprodfreq 1.00 > DEBUG: nmatches 99 matchprodfreq 1.00 > DEBUG: matchfreq1 99.00 unmatchfreq1 0.00 > DEBUG: matchfreq1 1.00 unmatchfreq1 0.00 > DEBUG: matchfreq2 99.00 unmatchfreq2 0.00 > DEBUG: matchfreq2 1.00 unmatchfreq2 0.00 > DEBUG: otherfreq1 0.00 otherfreq2 0.00 > DEBUG: select(1) 1.00 > Hash Join (cost=167.75..444.77 rows=9801 width=4) (actual > time=4.706..13.892 rows=9801 loops=1) > > > diff --git a/src/backend/utils/adt/selfuncs.c > b/src/backend/utils/adt/selfuncs.c > index 6a4f7b1..bc88423 100644 > --- a/src/backend/utils/adt/selfuncs.c > +++ b/src/backend/utils/adt/selfuncs.c > @@ -2279,6 +2279,14 @@ eqjoinsel_inner(Oid operator, > > nd1 = get_variable_numdistinct(vardata1, ); > nd2 = get_variable_numdistinct(vardata2, ); > + float ndfactor1=1; > + float ndfactor2=1; > + if (vardata1->rel->rows) > + ndfactor1=vardata1->rel->tuples / vardata1->rel->rows; > + if (vardata2->rel->rows) > + ndfactor2=vardata2->rel->tuples / vardata2->rel->rows; > + // ndfactor1=ndfactor2=1; > + elog(DEBUG4, "ndfactor %lf %lf", ndfactor1,ndfactor2); > > opfuncoid = get_opcode(operator); > > @@ -2375,7 +2383,19 @@ eqjoinsel_inner(Oid operator, > } > } > } > + > + // you might think we should multiple by ndfactor1*ndfactor2, > + // but that gives serious overestimates... > + // matchprodfreq*= ndfactor1>ndfactor2?ndfactor1:ndfactor2; > + // matchprodfreq*=ndfactor1; > +
Re: [PERFORM] Re: join under-estimates with ineq conditions
On Mon, Jun 05, 2017 at 05:02:32PM -0400, Tom Lane wrote: > Justin Pryzby <pry...@telsasoft.com> writes: > > diff --git a/src/backend/utils/adt/selfuncs.c > > b/src/backend/utils/adt/selfuncs.c > > + if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows; > > + if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows; > > I don't like this change too much. Thanks for your analysis ;) I have a 2nd patch which improves the 2nd case I mentioned.. > I note for instance that this patch would do nothing at all for the toy >> There's still an 2nd issue which this doesn't address, having to do with >> joins >> of tables with full/complete MCV lists, and selective queries on those >> tables, >> as demonstrated by the artificial test: >> >> > postgres=# CREATE TABLE t(i INT); >> > postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) >> > i,generate_series(1,99);ANALYZE t; >> > postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) >> > frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) >> > n_mcv, array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM >> > unnest(most_common_vals::text::text[]) x) maxmcv, >> > (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist >> > FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 1,2,3,4,5,6,7,8 >> > ORDER BY 1 DESC; I pointed out that there were two issues, both involving underestimates from querying a fraction of a table using inequality condition. One due to join estimate based on "nd" (and not substantially based on MCV), and one due to frequencies associated with MCV list (and not substantially falling back to estimate from "nd"). I made another patch to address the 2nd issue, which affects our pre-aggregated tables (which are partitioned by month, same as the raw tables). The aggregated tables are the result of something like SELECT start_time::date, k1, k2, ..., sum(a), avg(b) ... GROUP BY 1,2,3, so have many fewer rows, and nd for start_time::date column would be at most 31, so MCV list would be expected to be complete, same as the "toy" example I gave. Sometimes when we query the aggregated tables for a small number of days we get underestimate leading to nested loops.. Without patch: Merge Join (cost=339.59..341.57 rows=99 width=4) (actual time=10.190..17.430 rows=9801 loops=1) With patch: DEBUG: ndfactor 99.00 99.00 DEBUG: nmatches 99 matchprodfreq 1.00 DEBUG: nmatches 99 matchprodfreq 1.00 DEBUG: matchfreq1 99.00 unmatchfreq1 0.00 DEBUG: matchfreq1 1.00 unmatchfreq1 0.00 DEBUG: matchfreq2 99.00 unmatchfreq2 0.00 DEBUG: matchfreq2 1.00 unmatchfreq2 0.00 DEBUG: otherfreq1 0.00 otherfreq2 0.00 DEBUG: select(1) 1.00 Hash Join (cost=167.75..444.77 rows=9801 width=4) (actual time=4.706..13.892 rows=9801 loops=1) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 6a4f7b1..bc88423 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2279,6 +2279,14 @@ eqjoinsel_inner(Oid operator, nd1 = get_variable_numdistinct(vardata1, ); nd2 = get_variable_numdistinct(vardata2, ); + float ndfactor1=1; + float ndfactor2=1; + if (vardata1->rel->rows) + ndfactor1=vardata1->rel->tuples / vardata1->rel->rows; + if (vardata2->rel->rows) + ndfactor2=vardata2->rel->tuples / vardata2->rel->rows; + // ndfactor1=ndfactor2=1; + elog(DEBUG4, "ndfactor %lf %lf", ndfactor1,ndfactor2); opfuncoid = get_opcode(operator); @@ -2375,7 +2383,19 @@ eqjoinsel_inner(Oid operator, } } } + + // you might think we should multiple by ndfactor1*ndfactor2, + // but that gives serious overestimates... + // matchprodfreq*= ndfactor1>ndfactor2?ndfactor1:ndfactor2; + // matchprodfreq*=ndfactor1; + // matchprodfreq*=ndfactor2; + // matchprodfreq*= ndfactor1<ndfactor2?ndfactor1:ndfactor2; + matchprodfreq*= ndfactor1<ndfactor2?ndfactor1:ndfactor2; + + elog(DEBUG4, "nmatches %d matchprodfreq %lf", nmatches, matchprodfreq); CLAMP_PROBABILITY(matchprodfreq); + elog(DEBUG4, "nmatches %d matchprodfreq %lf", nmatches, matchprodfreq); + /* Sum up frequencies of matched and unmatched MCVs */ matchfreq1 = unmatchfreq1 = 0.0; for (i = 0; i < nvalues1; i++) @@ -2385,8 +2405,14 @@ eqjoinsel_inner(Oid operator,
[PERFORM] Re: join estimate of subqueries with range conditions and constraint exclusion
On Wed, May 24, 2017 at 04:17:30PM -0500, Justin Pryzby wrote: > We got bitten again by what appears to be the same issue I reported (perhaps > poorly) here: > https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com > I'm diagnosing a bad estimate/plan due to excessively high n_distinct leading > to underestimated rowcount when selecting from a small fraction of the table > heirarchy. This leads intermittently to bad things, specifically a cascade of > misestimates and associated nested loops around millions of rows. I dug into this some more; I can mitigate the issue with this change: diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 6a4f7b1..962a5b4 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2279,6 +2279,22 @@ eqjoinsel_inner(Oid operator, nd1 = get_variable_numdistinct(vardata1, ); nd2 = get_variable_numdistinct(vardata2, ); + elog(DEBUG4, "nd %lf %lf", nd1 ,nd2); + if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows; + if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows; + + elog(DEBUG4, "nd %lf %lf", nd1 ,nd2); + elog(DEBUG4, "rows %lf %lf", vardata1->rel->rows ,vardata2->rel->rows); + elog(DEBUG4, "tuples %lf %lf", vardata1->rel->tuples ,vardata2->rel->tuples); original estimate: DEBUG: nd 35206.00 35206.00 DEBUG: nd 35206.00 35206.00 DEBUG: rows 5031.00 5031.00 DEBUG: tuples 5031.00 5031.00 QUERY PLAN Hash Join (cost=1294.56..2558.62 rows=723 width=750) (actual time=103.273..490.984 rows=50300 loops=1) Hash Cond: (eric_enodeb_metrics.start_time = eric_enodeb_metrics_1.start_time) patched estimate/plan: postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM eric_enodeb_metrics WHERE start_time>'2017-04-25 18:00') x JOIN (SELECT * FROM eric_enodeb_metrics WHERE start_time>'2017-04-25 18:00') y USING (start_time); DEBUG: nd 35206.00 35206.00 DEBUG: nd 5031.00 5031.00 DEBUG: rows 5031.00 5031.00 DEBUG: tuples 5031.00 5031.00 | Hash Join (cost=1294.56..2602.14 rows=5075 width=750) (actual time=90.445..477.712 rows=50300 loops=1) | Hash Cond: (eric_enodeb_metrics.start_time = eric_enodeb_metrics_1.start_time) | -> Append (cost=0.00..1231.67 rows=5031 width=379) (actual time=16.424..46.899 rows=5030 loops=1) | -> Seq Scan on eric_enodeb_metrics (cost=0.00..0.00 rows=1 width=378) (actual time=0.012..0.012 rows=0 loops=1) | Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with time zone) | -> Seq Scan on eric_enodeb_201704 (cost=0.00..1231.67 rows=5030 width=379) (actual time=16.408..45.634 rows=5030 loops=1) | Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with time zone) | Rows Removed by Filter: 23744 | -> Hash (cost=1231.67..1231.67 rows=5031 width=379) (actual time=73.801..73.801 rows=5030 loops=1) | Buckets: 8192 Batches: 1 Memory Usage: 1283kB | -> Append (cost=0.00..1231.67 rows=5031 width=379) (actual time=14.607..47.395 rows=5030 loops=1) | -> Seq Scan on eric_enodeb_metrics eric_enodeb_metrics_1 (cost=0.00..0.00 rows=1 width=378) (actual time=0.009..0.009 rows=0 loops=1) | Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with time zone) | -> Seq Scan on eric_enodeb_201704 eric_enodeb_201704_1 (cost=0.00..1231.67 rows=5030 width=379) (actual time=14.594..46.091 rows=5030 loops=1) | Filter: (start_time > '2017-04-25 18:00:00-05'::timestamp with time zone) | Rows Removed by Filter: 23744 .. which gets additionally extreme with increasingly restrictive condition, as rows estimate diverges more from nd. There's still an 2nd issue which this doesn't address, having to do with joins of tables with full/complete MCV lists, and selective queries on those tables, as demonstrated by the artificial test: > postgres=# CREATE TABLE t(i INT); > postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) > i,generate_series(1,99);ANALYZE t; > postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, > tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, > array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM > unnest(most_common_vals::text::text[]) x) maxmcv, > (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist > FROM pg_stats WHERE attname
[PERFORM] join estimate of subqueries with range conditions and constraint exclusion
We got bitten again by what appears to be the same issue I reported (perhaps poorly) here: https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com We have PG9.6.3 table heirarchies partitioned by time. Our reports use subqueries each with their own copies of a range clauses on time column, as needed to get constraint exclusion reference: https://www.postgresql.org/message-id/25076.1366321335%40sss.pgh.pa.us SELECT * FROM (SELECT * FROM t WHERE col>const) a JOIN (SELECT * FROM t WHERE col>const) b USING (col) I'm diagnosing a bad estimate/plan due to excessively high n_distinct leading to underestimated rowcount when selecting from a small fraction of the table heirarchy. This leads intermittently to bad things, specifically a cascade of misestimates and associated nested loops around millions of rows. Artificial/generated/contrived test case, involving table with 99 instances each of 99 values: postgres=# CREATE TABLE t(i INT); postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) i,generate_series(1,99);ANALYZE t; postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM unnest(most_common_vals::text::text[]) x) maxmcv, (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 1,2,3,4,5,6,7,8 ORDER BY 1 DESC; -[ RECORD 1 ]-- frac_mcv | 1 tablename | t attname| i n_distinct | 99 n_mcv | 99 n_hist | maxmcv | 99 maxhist| range query (which could use constraint exclusion), but bad estimate: postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i<2) AS a JOIN (SELECT * FROM t WHERE i<2) AS b USING (i); Merge Join (cost=339.59..341.57 rows=99 width=4) (actual time=8.272..16.892 rows=9801 loops=1) range query which could NOT use constraint exclusion, good estimate: postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t) AS a JOIN (SELECT * FROM t) AS b USING (i) WHERE i<2; Hash Join (cost=264.52..541.54 rows=9801 width=4) (actual time=12.688..22.325 rows=9801 loops=1) non-range query, good estimate: postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i=3) AS a JOIN (SELECT * FROM t WHERE i=3) AS b USING (i); Nested Loop (cost=0.00..455.78 rows=9801 width=4) (actual time=0.482..15.820 rows=9801 loops=1) My understanding: Postgres estimates join selectivity using number of distinct values of underlying. For the subqueries "a" and "b", the estimate is same as for underlying table "t", even when selecting only a small fraction of the table... This is adt/selfuncs:eqjoinsel_inner(). Note, in my tests, report queries on the child table have correct estimates; and, queries with only "push down" WHERE clause outside the subquery have correct estimate (but not constraint exclusion), apparently due to calc_joinrel_size_estimate() returning the size of the parent table, planning an join without restriction clause, following by filtering the join result, at which point I guess the MCV list becomes useful and estimate is perfect.. SELECT * FROM (SELECT * FROM t)a JOIN(SELECT * FROM t)b USING (col) WHERE col>const So my original question is basically still opened ... is it possible to get both good estimates/plans AND constraint exclusion ?? Thanks Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance issue in PostgreSQL server...
On Mon, Mar 06, 2017 at 12:17:22PM +, Dinesh Chandra 12108 wrote: > Below is the output of Query SELECT * FROM pg_stats WHERE tablename='point' > AND attname='domain_class_id' ; > > > schemaname | tablename | attname | inherited | null_frac | avg_width > | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | > correlation > > "evidence"|"point"|"domain_class_id"|f|0|8|10|"{7,9,2,11,43,3,1,10,4,17}"|"{0.9322,0.0451333,0.0145,0.0039,0.0018,0.00146667,0.0005,0.0003,6.7e-05,6.7e-05}"|""|0.889078 On Fri, Mar 03, 2017 at 12:44:07PM +, Dinesh Chandra 12108 wrote: >-> Index Scan using point_domain_class_id_index on point p > (cost=0.00..1483472.70 rows=1454751 width=16) (actual time=27.265..142101.1 > 59 rows=1607491 loops=1) On Sun, Mar 05, 2017 at 08:23:08PM -0800, Jeff Janes wrote: > Why wouldn't this be using a bitmap scan rather than a regular index scan? > It seems like it should prefer the bitmap scan, unless the table is well > clustered on domain_class_id. In which case, why isn't it just faster? I missed your response until now, and can't see that anybody else responded, but I suspect the issue is that the *table* is highly correlated WRT this column, but the index may not be, probably due to duplicated index keys. postgres only stores statistics on expression indices, and falls back to correlation of table column of a simple indices. If you're still fighting this, would you send result of: SELECT domain_class_id, count(1) FROM point GROUP BY 1 ORDER BY 2 DESC LIMIT 22; or, SELECT count(1) FROM point GROUP BY domain_class_id ORDER BY 1 DESC LIMIT 22; if there's much repetition in the index keys, then PG's planner thinks an index scan has low random_page_cost, and effective_cache_size has little effect on large tables, and it never uses bitmap scan, which blows up if the index is fragmented and has duplicate keys. The table reads end up costing something like 1454751*random_page_cost nonsequential reads and fseek() calls when it thinks it'll cost only 1454751*16*seq_page_cost. Is the query much faster if you first reindex point_domain_class_id_index ? This has come up before, see: > https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com > https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com > https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] self join estimate and constraint exclusion
I mailed last month [0] but didn't see any reponse .. (if I'm being naive, daft, or missing something simple, please just say so). [0] https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com It seems when self (inner/equi) joining there's two bad alternatives: either specify a where clause for each self-joined table and incur poor estimate and plan, due to incorrect perceived independence of clauses, even though joined column(s) could/ought to be known equal; or, specify where clause only once, and incur cost of joining across all partitions, due to no contraint exclusion on one (or more) self-joined table heirarchy/s. -- Specify WHERE for each table causes bad underestimate: |ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN eric_enodeb_metrics b USING (start_time, site_id) WHERE a.start_time>='2017-03-19' AND a.start_time<'2017-03-20' AND b.start_time>='2017-03-19' AND b.start_time<'2017-03-20'; | Hash Join (cost=7310.80..14680.86 rows=14 width=1436) (actual time=33.053..73.180 rows=7869 loops=1) | Hash Cond: ((a.start_time = b.start_time) AND (a.site_id = b.site_id)) | -> Append (cost=0.00..7192.56 rows=7883 width=723) (actual time=1.394..19.414 rows=7869 loops=1) | -> Seq Scan on eric_enodeb_metrics a (cost=0.00..0.00 rows=1 width=718) (actual time=0.003..0.003 rows=0 loops=1) | Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Bitmap Heap Scan on eric_enodeb_201703 a_1 (cost=605.34..7192.56 rows=7882 width=723) (actual time=1.390..14.536 rows=7869 loops=1) | Recheck Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | Heap Blocks: exact=247 | -> Bitmap Index Scan on eric_enodeb_201703_unique_idx (cost=0.00..603.37 rows=7882 width=0) (actual time=1.351..1.351 rows=7869 loops=1) | Index Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Hash (cost=7192.56..7192.56 rows=7883 width=723) (actual time=31.620..31.620 rows=7869 loops=1) | Buckets: 8192 Batches: 1 Memory Usage: 1986kB | -> Append (cost=0.00..7192.56 rows=7883 width=723) (actual time=0.902..19.543 rows=7869 loops=1) | -> Seq Scan on eric_enodeb_metrics b (cost=0.00..0.00 rows=1 width=718) (actual time=0.002..0.002 rows=0 loops=1) | Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Bitmap Heap Scan on eric_enodeb_201703 b_1 (cost=605.34..7192.56 rows=7882 width=723) (actual time=0.899..14.353 rows=7869 loops=1) | Recheck Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | Heap Blocks: exact=247 | -> Bitmap Index Scan on eric_enodeb_201703_unique_idx (cost=0.00..603.37 rows=7882 width=0) (actual time=0.867..0.867 rows=7869 loops=1) | Index Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) -- Specify WHERE once gets good estimate, but with unnecessary scan of all child partitions: |ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN eric_enodeb_metrics b USING (start_time, site_id) WHERE start_time>='2017-03-19' AND start_time<'2017-03-20'; | Gather (cost=8310.80..316545.60 rows=9591 width=1427) (actual time=9012.967..9073.539 rows=7869 loops=1) | Workers Planned: 3 | Workers Launched: 3 | -> Hash Join (cost=7310.80..314586.50 rows=3094 width=1427) (actual time=8892.121..8937.245 rows=1967 loops=4) | Hash Cond: ((b.start_time = a.start_time) AND (b.site_id = a.site_id)) | -> Append (cost=0.00..261886.54 rows=2015655 width=714) (actual time=11.464..8214.063 rows=1308903 loops=4) | -> Parallel Seq Scan on eric_enodeb_metrics b (cost=0.00..0.00 rows=1 width=718) (actual time=0.001..0.001 rows=0 loops=4) | -> Parallel Seq Scan on eric_enodeb_201510 b_1 (cost=0.00..10954.43 rows=60343 width=707) (actual time=11.460..258.852 rows=46766 loops=4) | -> Parallel Seq Scan on eric_enodeb_201511 b_2 (cost=0.00..10310.91 rows=56891 width=707) (actual time=18.395..237.841 rows=44091 loops=4) |[...] | -> Parallel Seq Scan on eric_enodeb_201703 b_29 (cost=0.00..6959.75 rows=81875 width=723) (actual time=0.017..101.969 rows=49127 loops=4) | -> Hash (cost=7192.56..7192.56 rows=7883 width=723) (actual time=51.843..51.843 rows=7869
[PERFORM] self join estimate and constraint exclusion
It seems when self (inner/equi) joining there's two bad alternatives: either specify a where clause for each self-joined table and incur poor estimate and plan, due to incorrect perceived independence of clauses, even though joined column ought to be known equal; or, specify where clause only once, and incur cost of joining across all partitions, due to no contraint exclusion on (at least) one self-joined table heirarchy. -- Specify WHERE for each table causes bad underestimate: |ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN eric_enodeb_metrics b USING (start_time, site_id) WHERE a.start_time>='2017-03-19' AND a.start_time<'2017-03-20' AND b.start_time>='2017-03-19' AND b.start_time<'2017-03-20'; | Hash Join (cost=7310.80..14680.86 rows=14 width=1436) (actual time=33.053..73.180 rows=7869 loops=1) | Hash Cond: ((a.start_time = b.start_time) AND (a.site_id = b.site_id)) | -> Append (cost=0.00..7192.56 rows=7883 width=723) (actual time=1.394..19.414 rows=7869 loops=1) | -> Seq Scan on eric_enodeb_metrics a (cost=0.00..0.00 rows=1 width=718) (actual time=0.003..0.003 rows=0 loops=1) | Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Bitmap Heap Scan on eric_enodeb_201703 a_1 (cost=605.34..7192.56 rows=7882 width=723) (actual time=1.390..14.536 rows=7869 loops=1) | Recheck Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | Heap Blocks: exact=247 | -> Bitmap Index Scan on eric_enodeb_201703_unique_idx (cost=0.00..603.37 rows=7882 width=0) (actual time=1.351..1.351 rows=7869 loops=1) | Index Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Hash (cost=7192.56..7192.56 rows=7883 width=723) (actual time=31.620..31.620 rows=7869 loops=1) | Buckets: 8192 Batches: 1 Memory Usage: 1986kB | -> Append (cost=0.00..7192.56 rows=7883 width=723) (actual time=0.902..19.543 rows=7869 loops=1) | -> Seq Scan on eric_enodeb_metrics b (cost=0.00..0.00 rows=1 width=718) (actual time=0.002..0.002 rows=0 loops=1) | Filter: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | -> Bitmap Heap Scan on eric_enodeb_201703 b_1 (cost=605.34..7192.56 rows=7882 width=723) (actual time=0.899..14.353 rows=7869 loops=1) | Recheck Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) | Heap Blocks: exact=247 | -> Bitmap Index Scan on eric_enodeb_201703_unique_idx (cost=0.00..603.37 rows=7882 width=0) (actual time=0.867..0.867 rows=7869 loops=1) | Index Cond: ((start_time >= '2017-03-19 00:00:00-04'::timestamp with time zone) AND (start_time < '2017-03-20 00:00:00-04'::timestamp with time zone)) -- Specify WHERE once gets good estimate, but with unnecessary scan of all child partitions: |ts=# explain analyze SELECT * FROM eric_enodeb_metrics a JOIN eric_enodeb_metrics b USING (start_time, site_id) WHERE start_time>='2017-03-19' AND start_time<'2017-03-20'; | Gather (cost=8310.80..316545.60 rows=9591 width=1427) (actual time=9012.967..9073.539 rows=7869 loops=1) | Workers Planned: 3 | Workers Launched: 3 | -> Hash Join (cost=7310.80..314586.50 rows=3094 width=1427) (actual time=8892.121..8937.245 rows=1967 loops=4) | Hash Cond: ((b.start_time = a.start_time) AND (b.site_id = a.site_id)) | -> Append (cost=0.00..261886.54 rows=2015655 width=714) (actual time=11.464..8214.063 rows=1308903 loops=4) | -> Parallel Seq Scan on eric_enodeb_metrics b (cost=0.00..0.00 rows=1 width=718) (actual time=0.001..0.001 rows=0 loops=4) | -> Parallel Seq Scan on eric_enodeb_201510 b_1 (cost=0.00..10954.43 rows=60343 width=707) (actual time=11.460..258.852 rows=46766 loops=4) | -> Parallel Seq Scan on eric_enodeb_201511 b_2 (cost=0.00..10310.91 rows=56891 width=707) (actual time=18.395..237.841 rows=44091 loops=4) |[...] | -> Parallel Seq Scan on eric_enodeb_201703 b_29 (cost=0.00..6959.75 rows=81875 width=723) (actual time=0.017..101.969 rows=49127 loops=4) | -> Hash (cost=7192.56..7192.56 rows=7883 width=723) (actual time=51.843..51.843 rows=7869 loops=4) | Buckets: 8192 Batches: 1 Memory Usage: 1970kB | -> Append (cost=0.00..7192.56 rows=7883 width=723) (actual time=2.558..27.829 rows=7869 loops=4) | -> Seq Scan on
Re: [PERFORM] Performance issue in PostgreSQL server...
On Sun, Mar 05, 2017 at 08:23:08PM -0800, Jeff Janes wrote: > On Fri, Mar 3, 2017 at 4:44 AM, Dinesh Chandra 12108 >wrote: > > The below is the output for psql=> EXPLAIN ANALYZE SELECT DISTINCT > > feature_id FROM evidence.point p INNER JOIN evidence.observation_evidence > > oe ON p.feature_id = oe.evd_feature_id WHERE p.domain_class_id IN (11) AND > > (p.modification_time > '2015-05-10 00:06:56.056 IST' OR > > oe.modification_time > '2015-05-10 00:06:56.056 IST') ORDER BY feature_id > ... > > > -> Index Scan using point_domain_class_id_index on > > point p (cost=0.00..1483472.70 rows=1454751 width=16) (actual > > time=27.265..142101.1 59 rows=1607491 loops=1) > >Index Cond: (domain_class_id = 11) > > Why wouldn't this be using a bitmap scan rather than a regular index scan? > It seems like it should prefer the bitmap scan, unless the table is well > clustered on domain_class_id. In which case, why isn't it just faster? Could you send: SELECT * FROM pg_stats WHERE tablename='point' AND attname='domain_class_id' ; .. or if that's too verbose or you don't want to share the histogram or MCV list: SELECT correlation FROM pg_stats WHERE tablename='point' AND attname='domain_class_id' ; Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] no MCV list of tiny table with unique columns
On Wed, Nov 02, 2016 at 07:48:23PM -0400, Tom Lane wrote: > There's not a lot of point in worrying about your two-row table when these > other estimates are off by multiple orders of magnitude. In this > particular case my first bet would be that the planner has no idea about > the selectivity of the conditions on "echo_tango('seven_november'::text, > four_charlie)". Reformulating that, or maybe making an index on it just > so that ANALYZE will gather stats about it, could help. Thanks, you're exactly right. That's date_trunc('hour') BTW. We actually already have a "new way" of doing that which avoids date_trunc, so now I just have to get it in place for 100+ old reports.. I thought I had tried that before, but I think I was confusing myself, and tried putting the index on the parent, which ends up with no stats since it's empty. With indices+analyze: Sort (cost=189014.28..189014.28 rows=1 width=785) (actual time=25063.831..25063.886 rows=328 loops=1) ... BTW: join_collapse_limit | 8 from_collapse_limit | 8 ..and changing them doesn't seem to have any effect. By my count there's 11 tables, not counting multiply a few used multiply.. Thanks again. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] no MCV list of tiny table with unique columns
On Wed, Nov 02, 2016 at 04:05:46PM -0400, Tom Lane wrote: > Justin Pryzby <pry...@telsasoft.com> writes: > > I believe the join is being (badly) underestimated, leading to a crappy plan > > involving multiple nested loop joins, which takes 2.5 hours instead of a > > handful of seconds; I believe that might be resolved by populating its MCV > > list.. > > With only two rows in the table, I'm not real sure why you'd need an MCV > list. Could we see the actual problem query (and the other table > schemas), rather than diving into the code first? Sigh, yes, but understand that it's a legacy report which happens to currently be near the top of my list of things to improve: https://explain.depesz.com/s/5rN6 The relevant table is involved three times: Seq Scan on two_november mike_oscar (cost=0.000..1.020 rows=2 width=18) (actual time=0.010..0.010 rows=2 loops=1) Seq Scan on echo_oscar foxtrot (cost=0.000..209.860 rows=6,286 width=13) (actual time=0.014..2.271 rows=5,842 loops=1) Seq Scan on two_november xray_yankee_alpha (cost=0.000..1.020 rows=2 width=18) (actual time=0.017..0.019 rows=2 loops=1) Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] no MCV list of tiny table with unique columns
We have a report query which joins (multiple times, actually) against this trivial, tiny table: ts=# \d bsm_to_switch Table "public.bsm_to_switch" Column | Type | Modifiers +--+--- bsm| text | not null switch | text | not null ts=# SELECT length(bsm), length(switch) FROM bsm_to_switch; length | length + 10 | 6 10 | 6 (2 rows) The column values are distinct. I believe the join is being (badly) underestimated, leading to a crappy plan involving multiple nested loop joins, which takes 2.5 hours instead of a handful of seconds; I believe that might be resolved by populating its MCV list.. ..however, on reading commands/analyze.c, the issue is these columns have no duplicates, and also postgres decides that "since the number of distinct rows is greater than 10% of the total number of rows", that ndistinct should be -1 (meaning it scales with the table size). That's fine, except that it then effectively precludes populating the MCV list. | if (nmultiple == 0) | { | /* |* If we found no repeated non-null values, assume it's a unique |* column; but be sure to discount for any nulls we found. |*/ | stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac); | } | else if (track_cnt < track_max && toowide_cnt == 0 && |nmultiple == track_cnt) | { | /* |* Our track list includes every value in the sample, and every |* value appeared more than once. Assume the column has just |* these values. (This case is meant to address columns with |* small, fixed sets of possible values, such as boolean or enum |* columns. If there are any values that appear just once in the |* sample, including too-wide values, we should assume that that's |* not what we're dealing with.) |*/ | stats->stadistinct = track_cnt; | } ts=# SELECT attname, inherited, null_frac, avg_width, n_distinct, most_common_vals FROM pg_stats WHERE tablename='bsm_to_switch'; attname | inherited | null_frac | avg_width | n_distinct | most_common_vals -+---+---+---++-- bsm | f | 0 |11 | -1 | switch | f | 0 | 7 | -1 | (2 rows) Any ideas? I tried setting n_distinct=2, but that seems to not have any effect within ANALYZE itself. ts=# SELECT attname, inherited, null_frac, avg_width, n_distinct, most_common_vals FROM pg_stats WHERE tablename='bsm_to_switch'; attname | inherited | null_frac | avg_width | n_distinct | most_common_vals -+---+---+---++-- bsm | f | 0 |11 | 2 | switch | f | 0 | 7 | 2 | (2 rows) Thanks in advance. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] index fragmentation on insert-only table with non-unique column
Regarding this earlier thread: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pry...@telsasoft.com> wrote: > Summary: Non-unique btree indices are returning CTIDs for rows with same > value of indexed column not in logical order, imposing a high performance > penalty. I have to point out that by "logical" I clearly meant "physical", hopefully nobody was too misled.. On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote: > On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Claudio Freire <klaussfre...@gmail.com> writes: > >> So correlated index scans look extra favourable vs bitmap index scans > >> because bitmap heap scans consider random page costs sans correlation > >> effects (even though correlation applies to bitmap heap scans as > >> well). > > > > Really? How? The index ordering has nothing to do with the order in > > which heap tuples will be visited. > > It is not the order itself, but the density. > > If the index is read in a range scan (as opposed to =ANY scan), and > the index lead column is correlated with the table ordering, then the > parts of the table that need to be visited will be much denser than if > there were no correlation. But Claudio is saying that this is not > being accounted for. I didn't completely understand Claudio/Jeff here, and not sure if we're on the same page. For queries on these tables, the index scan was very slow, due to fragmented index on non-unique column, and seq scan would have been (was) faster (even if it means reading 70GB and filtering out 6 of 7 days' data). That was resolved by added a nightly reindex job (.. which sometimes competes with other maintenance and has trouble running every table every night). But I did find that someone else had previously reported this problem (in a strikingly similar context and message, perhaps clearer than mine): https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com I also found this older thread: https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com There was mention of a TODO item: * Compute index correlation on CREATE INDEX and ANALYZE, use it for index * scan cost estimation .. but perhaps I misunderstand and that's long since resolved ? Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] index fragmentation on insert-only table with non-unique column
On Fri, Jun 03, 2016 at 06:26:33PM -0300, Claudio Freire wrote: > On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby <pry...@telsasoft.com> wrote: > >> > First, I found I was able to get 30-50min query results on full week's > >> > table by > >> > prefering a seq scan to an index scan. The row estimates seemed fine, > >> > and the > >> > only condition is the timestamp, so the planner's use of index scan is as > >> > expected. > >> > >> Can you show us the query? I would expect a bitmap scan of the index > >> (which would do what you want, but even more so), instead. > > See explain, also showing additional tables/views being joined. It's NOT > > doing > > a bitmap scan though, and I'd be interested to find why; I'm sure that > > would've > > improved this query enough so it never would've been an issue. > > https://explain.depesz.com/s/s8KP > > > > -> Index Scan using > > cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on > > cdrs_huawei_pgwrecord_2016_05_01 (cost=0.56..1601734.57 rows=8943848 > > width=349) > >Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp > > without time zone) AND (recordopeningtime < '2016-05-08 > > 00:00:00'::timestamp without time zone)) > > Please show your guc settings ( see > https://wiki.postgresql.org/wiki/Server_Configuration ) > > A plan node like that, if it would result in I/O, with proper > configuration should have selected a bitmap index/heap scan. If it > didn't, it probably thinks it has more cache than it really does, and > that would mean the wrong setting was set in effective_cache_size. ts=# SELECT name, current_setting(name), SOURCE FROM pg_settings WHERE SOURCE='configuration file'; dynamic_shared_memory_type | posix | configuration file effective_cache_size| 64GB| configuration file effective_io_concurrency| 8 | configuration file huge_pages | try | configuration file log_autovacuum_min_duration | 0 | configuration file log_checkpoints | on | configuration file maintenance_work_mem| 6GB | configuration file max_connections | 200 | configuration file max_wal_size| 4GB | configuration file min_wal_size| 6GB | configuration file shared_buffers | 8GB | configuration file wal_compression | on | configuration file work_mem| 1GB | configuration file I changed at least maintenance_work_mem since I originally wrote, to try to avoid tempfiles during REINDEX (though I'm not sure it matters, as the tempfiles are effective cached and may never actually be written). It's entirely possible those settings aren't ideal. The server has 72GB RAM. There are usually very few (typically n<3 but at most a handful) nontrivial queries running at once, if at all. I wouldn't expect any data that's not recent (table data last 2 days or index from this month) to be cached, and wouldn't expect that to be entirely cached, either: ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_..$'; gb | 425.783050537109 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_2016_05_...*idx'; gb | 60.0909423828125 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605.*idx'; gb | 4.85528564453125 ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE relname~'_201605$'; gb | 86.8688049316406 As a test, I did SET effective_cache_size='1MB', before running explain, and still does: |-> Index Scan using cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_29 (cost=0.44..1526689.49 rows=8342796 width=355) | Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 00:00:00'::timestamp without time zone)) I Set enable_indexscan=0, and got: |-> Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29 (cost=168006.10..4087526.04 rows=8342796 width=355) | Recheck Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 00:00:00'::timestamp without time zone)) | -> Bitmap Index Scan on cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx (cost=0.00..165920.40 rows=8342796 width=0) |Index Cond: ((recordopeningtime >= '2016-05-29 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30
Re: [PERFORM] index fragmentation on insert-only table with non-unique column
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote: > > But note the non-uniqueness of the index column: > > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE > > recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY > > 1 ORDER BY 2 DESC; > > recordopeningtime | count > > -+--- > > 2016-05-21 12:17:29 | 176 > > 2016-05-21 12:17:25 | 171 > > 2016-05-21 13:11:33 | 170 > > 2016-05-21 10:20:02 | 169 > > 2016-05-21 11:30:02 | 167 > > [...] > > That is not that much duplication. You aren't going to have dozens or > hundreds of leaf pages all with equal values. (and you only showed > the most highly duplicated ones, presumably the average is much less) Point taken, but it's not that great of a range either: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 LIMIT 19; recordopeningtime | count -+--- 2016-05-21 03:10:05 |44 2016-05-21 03:55:05 |44 2016-05-21 04:55:05 |45 ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'; -[ RECORD 1 ] count | 86400 ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22'; -[ RECORD 1 ]-- count | 8892865 > > We have an daily analytic query which processes the previous day's data. > > For > > new child tables, with only 1 days data loaded, this runs in ~30min, and for > > child tables with an entire week's worth of data loaded, takes several hours > > (even though both queries process the same amount of data). > > For an append only table, why would the first day of a new partition > be any less fragmented than that same day would be a week from now? > Are you sure it isn't just that your week-old data has all been aged > out of the cache? I don't think it's cache effect, since we're not using the source table for (maybe anything) else the entire rest of the day. Server has 72GB RAM, same size one the largest of the tables being joined (beginning) at 4am. I didn't mean that a given day is more fragmented now than it was last week (but I don't know, either). I guess when we do a query on the table with ~32 hours of data in, it might do a seq scan rather than index scan, too. Compare the end of month partition tables: ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx'); leaf_fragmentation | 48.6 ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx'); leaf_fragmentation | 48.38 ts=# select * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx'); leaf_fragmentation | 48.6 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx'); leaf_fragmentation | 48.66 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx'); leaf_fragmentation | 48.27 ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx'); leaf_fragmentation | 48 This one I reindexed as a test: ts=# SELECT * FROM pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx'); leaf_fragmentation | 0.01 .. and query ran in ~30min (reran a 2nd time, with cache effects: 25min). > > First, I found I was able to get 30-50min query results on full week's > > table by > > prefering a seq scan to an index scan. The row estimates seemed fine, and > > the > > only condition is the timestamp, so the planner's use of index scan is as > > expected. > > Can you show us the query? I would expect a bitmap scan of the index > (which would do what you want, but even more so), instead. See explain, also showing additional tables/views being joined. It's NOT doing a bitmap scan though, and I'd be interested to find why; I'm sure that would've improved this query enough so it never would've been an issue. https://explain.depesz.com/s/s8KP -> Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on cdrs_huawei_pgwrecord_2016_05_01 (cost=0.56..1601734.57 rows=8943848 width=349) Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-08 00:00:00'::timestamp without time zone)) > > AFAICT what's happening is that the index scan was returning pages > > nonsequentially. strace-ing the backend showed alternating lseek()s and > > read()s, with the offsets not consistently increasing (nor consistently > > decreasing): .. > > Which of those are the table, and which the index? Those weren't necessarily strace of the same process; I believe both of these were table data/heap, and didn't include any index access. > Something doesn't add up here. How could an index of an append-only > table possibly become that
Re: [PERFORM] index fragmentation on insert-only table with non-unique column
On Tue, May 24, 2016 at 09:16:20PM -0700, Peter Geoghegan wrote: > On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pry...@telsasoft.com> wrote: > > Postgres seems to assume that the high degree of correlation of the table > > column seen in pg_stats is how it will get data from the index scan, which > > assumption seems to be very poor on what turns out to be a higly fragmented > > index. Is there a way to help it to understand otherwise?? > > Your complaint is vague. Are you complaining about the planner making > a poor choice? I don't think that's the issue here, because you never > made any firm statement about the planner making a choice that was > worth than an alternative that it had available. I was thinking there a few possible places to make improvements: the planner could have understood that scans of non-unique indices don't result in strictly sequential scans of the table, the degree of non-sequentialness being determined by the column statistics, and perhaps by properties of the index itself. Or the INSERT code or btree scan could improve on this, even if tuples aren't fully ordered. > If you're arguing for the idea that B-Trees should reliably keep > tuples in order by a tie-break condition, that seems difficult to > implement, and likely not worth it in practice. I had the awful idea to change the index to use (recordopeningtime,ctid). Maybe somebody will convince me otherwise, but may actually work better than trying to reindex this table daily by 4am. Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] index fragmentation on insert-only table with non-unique column
Summary: Non-unique btree indices are returning CTIDs for rows with same value of indexed column not in logical order, imposing a high performance penalty. Running PG 9.5.3 now, we have a time-based partitions of append-only tables with data loaded from other sources. The tables are partitioned by time, and timestamp column has an non-unique, not-null btree index. The child tables are each ~75GB and expected to keep growing. For a child table with a week's worth of data: relpages | 11255802 reltuples | 5.90502e+07 The data is loaded shortly after it's available, so have high correlation in pg_statistic: [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head tablename | correlation | n_distinct --+-+ cdrs_huawei_pgwrecord|0.97 | 102892 cdrs_huawei_pgwrecord_2016_02_15 |0.999658 | 96145 cdrs_huawei_pgwrecord_2016_02_22 |0.43 | 91916 cdrs_huawei_pgwrecord_2016_02_29 |0.997219 | 50341 cdrs_huawei_pgwrecord_2016_03_01 |0.47 | 97485 But note the non-uniqueness of the index column: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC; recordopeningtime | count -+--- 2016-05-21 12:17:29 | 176 2016-05-21 12:17:25 | 171 2016-05-21 13:11:33 | 170 2016-05-21 10:20:02 | 169 2016-05-21 11:30:02 | 167 [...] We have an daily analytic query which processes the previous day's data. For new child tables, with only 1 days data loaded, this runs in ~30min, and for child tables with an entire week's worth of data loaded, takes several hours (even though both queries process the same amount of data). First, I found I was able to get 30-50min query results on full week's table by prefering a seq scan to an index scan. The row estimates seemed fine, and the only condition is the timestamp, so the planner's use of index scan is as expected. AFAICT what's happening is that the index scan was returning pages nonsequentially. strace-ing the backend showed alternating lseek()s and read()s, with the offsets not consistently increasing (nor consistently decreasing): % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' lseek(773, 1059766272, SEEK_SET)= 1059766272 lseek(773, 824926208, SEEK_SET) = 824926208 lseek(773, 990027776, SEEK_SET) = 990027776 lseek(773, 990330880, SEEK_SET) = 990330880 lseek(773, 1038942208, SEEK_SET)= 1038942208 lseek(773, 1059856384, SEEK_SET)= 1059856384 lseek(773, 977305600, SEEK_SET) = 977305600 lseek(773, 990347264, SEEK_SET) = 990347264 lseek(773, 871096320, SEEK_SET) = 871096320 .. and consecutive read()s being rare: read(802, "g"..., 8192) = 8192 lseek(802, 918003712, SEEK_SET) = 918003712 read(802, "c"..., 8192) = 8192 lseek(802, 859136000, SEEK_SET) = 859136000 read(802, "a"..., 8192) = 8192 lseek(802, 919601152, SEEK_SET) = 919601152 read(802, "d"..., 8192) = 8192 lseek(802, 905101312, SEEK_SET) = 905101312 read(802, "c"..., 8192) = 8192 lseek(801, 507863040, SEEK_SET) = 507863040 read(801, "p"..., 8192) = 8192 lseek(802, 914235392, SEEK_SET) = 914235392 read(802, "c"..., 8192) = 8192 I was able to see great improvement without planner parameters by REINDEX the timestamp index. My theory is that the index/planner doesn't handle well the case of many tuples with same column value, and returns pages out of logical order. Reindex fixes that, rewriting the index data with pages in order (confirmed with pageinspect), which causes index scans to fetch heap data more or less monotonically (if not consecutively). strace shows that consecutive read()s are common (without intervening seeks). I gather this allows the OS readahead to kick in. Postgres seems to assume that the high degree of correlation of the table column seen in pg_stats is how it will get data from the index scan, which assumption seems to be very poor on what turns out to be a higly fragmented index. Is there a way to help it to understand otherwise?? Maybe this is a well-understood problem/deficiency; but, is there any reason why Btree scan can't sort result by ctid for index tuples with same column value (_bt_steppage() or btgettuple())? Or maybe the problem could be mitigated by changing the behavior during INESRT? In the meantime, I'll be implementing a reindex job. Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription:
Re: [PERFORM] Clarification on using pg_upgrade
On Fri, Mar 04, 2016 at 02:27:59PM -0800, Tory M Blue wrote: > If my data is located in /data > > and I link to a new dir in /data1, what actually happens. do I end up with > 2 file systems and links and thus am not able to delete or cleanup any old > data, or how does this work? > > Also will the reindex creation still happen with this type of in-place > upgrade, as if so, then it may not save too much time vs a dump/import. Since you have the space, you can do a test upgrade; make a dump of the essential tables (or the entire thing) and restore it to another instance, perhaps even something run from your /home. pg_upgrade --link makes hardlinks for tables and indices (same as cp -l), so uses very little additional space. Note, that means that both must be within the filesystem (/data). You should understand about hardinks and inodes otherwise this will lead to confusion and mistakes. Indexes don't need to be rebuilt afterwards. I've upgraded ~35 customers to 9.5 already, some as big as 5TB. So far the disruption has been at most 30min (not counting ANALYZE afterwards). When I use pg_upgrade, after stopping the old instance, I rename the data dir (under centos, /var/lib/pgsql/9.4~). Then pg_upgrade makes links in 9.5/. Renaming has the advantage that the old instances can't be accidentally started; and, makes it much easier to believe that it's safe to remove the 9.4~ afterwards. Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] dump time increase by 1h with new kernel
[I got no response on -general for a few days so I'm trying here] When we upgraded from linux-2.6.24 to linux-2.6.27, our pg_dump duration increased by 20% from 5 hours to 6. My first attempt at resolution was to boot with elevator=deadline. However that's actually the default IO scheduler in both kernels. The two dmesg's are at: https://www.norchemlab.com/tmp/linux-2.6.24-22.45-server https://www.norchemlab.com/tmp/linux-2.6.27-14.41-server The database partition is: xfs / lvm / aic79xx / scsi. Booting back into the .24 kernel brings the pg_dump back down to 5 hours (for daily 20GB output compressed by pg_dump -Fc). Does anyone know what might be different which could cause such a drastic change? Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] dump time increase by 1h with new kernel
[I got no response on -general for a few days so I'm trying here] When we upgraded from linux-2.6.24 to linux-2.6.27, our pg_dump duration increased by 20% from 5 hours to 6. My first attempt at resolution was to boot with elevator=deadline. However that's actually the default IO scheduler in both kernels. The two dmesg's are at: https://www.norchemlab.com/tmp/linux-2.6.24-22.45-server https://www.norchemlab.com/tmp/linux-2.6.27-14.41-server The database partition is: xfs / lvm / aic79xx / scsi. Booting back into the .24 kernel brings the pg_dump back down to 5 hours (for daily 20GB output compressed by pg_dump -Fc). Does anyone know what might be different which could cause such a drastic change? Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance