Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
On Fri, Mar 24, 2017 at 11:06 PM, Simon Riggs wrote: > On 1 March 2017 at 01:36, Amit Langote wrote: > >> I don't know which way you're thinking of fixing this, but a planner patch >> to implement faster partition-pruning will have taken care of this, I >> think. As you may know, even declarative partitioned tables currently >> depend on constraint exclusion for partition-pruning and planner's current >> approach of handling inheritance requires to open all the child tables >> (partitions), whereas the new approach hopefully shouldn't need to do >> that. I am not sure if looking for a more localized fix for this would be >> worthwhile, although I may be wrong. > > What "new approach" are we discussing? > > Is there a patch or design discussion? Neither at the moment. As Aleksander said in his reply I was referring to Dmitry Ivanov's plan of porting pg_pathman's planner functionality to core that he mentioned on the declarative partitioning thread back in December [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Simon, > > I don't know which way you're thinking of fixing this, but a planner patch > > to implement faster partition-pruning will have taken care of this, I > > think. As you may know, even declarative partitioned tables currently > > depend on constraint exclusion for partition-pruning and planner's current > > approach of handling inheritance requires to open all the child tables > > (partitions), whereas the new approach hopefully shouldn't need to do > > that. I am not sure if looking for a more localized fix for this would be > > worthwhile, although I may be wrong. > > What "new approach" are we discussing? > Is there a patch or design discussion? I think what was meant was plans of my colleague Dmitry Ivanov to implement partition-pruning. I've just spoke with Dmitry about this matter. Unless there is anyone else who is already working on this optimization we would like to work on it together. Unfortunately there is no patch or design discussion of partition-pruning on this commitfest. -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
On 1 March 2017 at 01:36, Amit Langote wrote: > I don't know which way you're thinking of fixing this, but a planner patch > to implement faster partition-pruning will have taken care of this, I > think. As you may know, even declarative partitioned tables currently > depend on constraint exclusion for partition-pruning and planner's current > approach of handling inheritance requires to open all the child tables > (partitions), whereas the new approach hopefully shouldn't need to do > that. I am not sure if looking for a more localized fix for this would be > worthwhile, although I may be wrong. What "new approach" are we discussing? Is there a patch or design discussion? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Tels, Thanks a lot for the review! > "corresponding" Fixed. > Also a question: Some one-line comments are > > /* Comment. */ > > while others are > > /* > * Comment. > */ > > Why the difference? I'm trying to follow a code stile used in a code I'm modifying. In this case I got an impression that second style of one-line comments is used more often an I tried to follow it but apparently I didn't succeed :) This is fixed as well. On Thu, Mar 09, 2017 at 06:40:39PM -0500, Tels wrote: > Hi Aleksander, > > noticed this in your patch: > > > * Add a corespinding entry to pgStatTabHash. > > "corresponding" > > Also a question: Some one-line comments are > > /* Comment. */ > > while others are > > /* > * Comment. > */ > > Why the difference? > > Hope this helps, > > Tels -- Best regards, Aleksander Alekseev diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 7cacb1e9b2..a22a5a37c8 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -161,6 +161,20 @@ typedef struct TabStatusArray static TabStatusArray *pgStatTabList = NULL; /* + * pgStatTabHash entry + */ +typedef struct TabStatHashEntry +{ + Oid t_id; + PgStat_TableStatus* tsa_entry; +} TabStatHashEntry; + +/* + * Hash table for O(1) t_id -> tsa_entry lookup + */ +static HTAB *pgStatTabHash = NULL; + +/* * Backends store per-function info that's waiting to be sent to the collector * in this hash table (indexed by function OID). */ @@ -815,7 +829,13 @@ pgstat_report_stat(bool force) pgstat_send_tabstat(this_msg); this_msg->m_nentries = 0; } + + /* + * Entry will be zeroed soon so we need to remove it from the lookup table. + */ + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, NULL); } + /* zero out TableStatus structs after use */ MemSet(tsa->tsa_entries, 0, tsa->tsa_used * sizeof(PgStat_TableStatus)); @@ -1667,59 +1687,77 @@ pgstat_initstats(Relation rel) } /* + * Make sure pgStatTabList and pgStatTabHash are initialized. + */ +static void +make_sure_stat_tab_initialized() +{ + HASHCTL ctl; + + if (pgStatTabList) + return; + + Assert(!pgStatTabHash); + + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(TabStatHashEntry); + ctl.hcxt = TopMemoryContext; + + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup table", + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + + pgStatTabList = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, +sizeof(TabStatusArray)); +} + +/* * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel */ static PgStat_TableStatus * get_tabstat_entry(Oid rel_id, bool isshared) { + TabStatHashEntry* hash_entry; PgStat_TableStatus *entry; TabStatusArray *tsa; - TabStatusArray *prev_tsa; - int i; + bool found; + + make_sure_stat_tab_initialized(); + + /* + * Find an entry or create a new one. + */ + hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found); + if(found) + return hash_entry->tsa_entry; /* - * Search the already-used tabstat slots for this relation. + * `hash_entry` was just created and now we have to fill it. + * First make sure there is a free space in a last element of pgStatTabList. */ - prev_tsa = NULL; - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = tsa->tsa_next) + tsa = pgStatTabList; + while(tsa->tsa_used == TABSTAT_QUANTUM) { - for (i = 0; i < tsa->tsa_used; i++) + if(tsa->tsa_next == NULL) { - entry = &tsa->tsa_entries[i]; - if (entry->t_id == rel_id) -return entry; + tsa->tsa_next = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, + sizeof(TabStatusArray)); } - if (tsa->tsa_used < TABSTAT_QUANTUM) - { - /* - * It must not be present, but we found a free slot instead. Fine, - * let's use this one. We assume the entry was already zeroed, - * either at creation or after last use. - */ - entry = &tsa->tsa_entries[tsa->tsa_used++]; - entry->t_id = rel_id; - entry->t_shared = isshared; - return entry; - } + tsa = tsa->tsa_next; } /* - * We ran out of tabstat slots, so allocate more. Be sure they're zeroed. - */ - tsa = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext, - sizeof(TabStatusArray)); - if (prev_tsa) - prev_tsa->tsa_next = tsa; - else - pgStatTabList = tsa; - - /* - * Use the first entry of the new TabStatusArray. + * Add an entry. */ entry = &tsa->tsa_entries[tsa->tsa_used++]; entry->t_id = rel_id; entry->t_shared = isshared; + + /* + * Add a corresponding entry to pgStatTabHash. + */ + hash_entry->tsa_entry = entry; return entry; } @@ -1731,22 +1769,19 @@ get_tabstat_entry(Oid rel_id, bool isshared) PgStat_TableStatus * find_tabstat_entry(Oid rel_id) { - PgStat_TableStatus *entry; - TabStatusArray *tsa; - int i; + TabStatHashEntry* hash_entry; - for (tsa = pgStatTabList; tsa
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, noticed this in your patch: > * Add a corespinding entry to pgStatTabHash. "corresponding" Also a question: Some one-line comments are /* Comment. */ while others are /* * Comment. */ Why the difference? Hope this helps, Tels PS: Sorry if this appears twice, I fumbled the From: ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, noticed this in your patch: > * Add a corespinding entry to pgStatTabHash. "corresponding" Also a question: Some one-line comments are /* Comment. */ while others are /* * Comment. */ Why the difference? Hope this helps, Tels -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Amit, > Sorry, I didn't mean to dissuade you from trying those > micro-optimizations. If general inheritance cases can benefit from it > (which, until we have a different method, will be used by partitioned > tables as well), I think we should try it. OK, I'll see what could be done here as well then. On Tue, Mar 07, 2017 at 10:55:12AM +0900, Amit Langote wrote: > Hi Aleksander, > > On 2017/03/07 0:22, Aleksander Alekseev wrote: > > Hello. > > > > OK, here is a patch. > > > > Benchmark, before: > > > > ``` > > number of transactions actually processed: 1823 > > latency average = 1153.495 ms > > latency stddev = 154.366 ms > > tps = 6.061104 (including connections establishing) > > tps = 6.061211 (excluding connections establishing) > > ``` > > > > Benchmark, after: > > > > ``` > > number of transactions actually processed: 2462 > > latency average = 853.862 ms > > latency stddev = 112.270 ms > > tps = 8.191861 (including connections establishing) > > tps = 8.192028 (excluding connections establishing) > > ``` > > > > +35% TPS, just as expected. Feel free to run your own benchmarks on > > different datasets and workloads. `perf top` shows that first bottleneck > > is completely eliminated. > > That seems like a good gain. > > > I did nothing about the second bottleneck > > since as Amit mentioned partition-pruning should solve this anyway and > > apparently any micro-optimizations don't worth an effort. > > Sorry, I didn't mean to dissuade you from trying those > micro-optimizations. If general inheritance cases can benefit from it > (which, until we have a different method, will be used by partitioned > tables as well), I think we should try it. > > Thanks, > Amit > > -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, Andres Thanks a lot for the review! > Why are we keeping that list / the "batch" allocation pattern? It > doesn't actually seem useful to me after these changes. Given that we > don't shrink hash-tables, we could just use the hash-table memory for > this, no? I don't think we can do that. According to comments: ``` * NOTE: once allocated, TabStatusArray structures are never moved or deleted [...] * This allows relcache pgstat_info pointers to be treated as long-lived data, * avoiding repeated searches in pgstat_initstats() when a relation is * repeatedly opened during a transaction. ``` It is my understanding that dynahash can't give same guarantees. > 'faster ... lookup' doesn't strike me as a useful comment, because it's > only useful in relation of the current code to the new code. Agree, fixed to 'O(1) ... lookup'. > It's not really freed... Right, it's actually zeroed. Fixed. > Arguably it'd be better to HASH_ENTER directly here, instead of doing > two lookups. Good point. Fixed. > Think it'd be better to move the hash creation into its own function. Agree, fixed. On Mon, Mar 06, 2017 at 12:01:50PM -0800, Andres Freund wrote: > Hi, > > This issue has bothered me in non-partitioned use-cases recently, so > thanks for taking it up. > > > On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote: > > diff --git a/src/backend/postmaster/pgstat.c > > b/src/backend/postmaster/pgstat.c > > index 2fb9a8bf58..fa906e7930 100644 > > --- a/src/backend/postmaster/pgstat.c > > +++ b/src/backend/postmaster/pgstat.c > > @@ -160,6 +160,16 @@ typedef struct TabStatusArray > > > > static TabStatusArray *pgStatTabList = NULL; > > Why are we keeping that list / the "batch" allocation pattern? It > doesn't actually seem useful to me after these changes. Given that we > don't shrink hash-tables, we could just use the hash-table memory for > this, no? > > I think a separate list that only contains things that are "active" in > the current transaction makes sense, because the current logic requires > iterating over a potentially very large array at transaction commit. > > > > +/* pgStatTabHash entry */ > > +typedef struct TabStatHashEntry > > +{ > > + Oid t_id; > > + PgStat_TableStatus* tsa_entry; > > +} TabStatHashEntry; > > + > > +/* Hash table for faster t_id -> tsa_entry lookup */ > > +static HTAB *pgStatTabHash = NULL; > > + > > 'faster ... lookup' doesn't strike me as a useful comment, because it's > only useful in relation of the current code to the new code. > > > > > /* > > * Backends store per-function info that's waiting to be sent to the > > collector > > * in this hash table (indexed by function OID). > > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) > > pgstat_send_tabstat(this_msg); > > this_msg->m_nentries = 0; > > } > > + > > + /* > > +* Entry will be freed soon so we need to remove it > > from the lookup table. > > +*/ > > + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, > > NULL); > > } > > It's not really freed... > > > > static PgStat_TableStatus * > > get_tabstat_entry(Oid rel_id, bool isshared) > > { > > + TabStatHashEntry* hash_entry; > > PgStat_TableStatus *entry; > > TabStatusArray *tsa; > > - TabStatusArray *prev_tsa; > > - int i; > > + > > + /* Try to find an entry */ > > + entry = find_tabstat_entry(rel_id); > > + if(entry != NULL) > > + return entry; > > Arguably it'd be better to HASH_ENTER directly here, instead of doing > two lookups. > > > > /* > > -* Search the already-used tabstat slots for this relation. > > +* Entry doesn't exist - creating one. > > +* First make sure there is a free space in a last element of > > pgStatTabList. > > */ > > - prev_tsa = NULL; > > - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = > > tsa->tsa_next) > > + if (!pgStatTabList) > > { > > - for (i = 0; i < tsa->tsa_used; i++) > > - { > > - entry = &tsa->tsa_entries[i]; > > - if (entry->t_id == rel_id) > > - return entry; > > - } > > + HASHCTL ctl; > > > > - if (tsa->tsa_used < TABSTAT_QUANTUM) > > + Assert(!pgStatTabHash); > > + > > + memset(&ctl, 0, sizeof(ctl)); > > + ctl.keysize = sizeof(Oid); > > + ctl.entrysize = sizeof(TabStatHashEntry); > > + ctl.hcxt = TopMemoryContext; > > + > > + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup > > table", > > + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | > > HASH_CONTEXT); > > Think it'd be better to move the hash creation into its own function. > > > - Andres -- Best regards, Aleksander Alekseev diff --git a/src/backend/postma
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi Aleksander, On 2017/03/07 0:22, Aleksander Alekseev wrote: > Hello. > > OK, here is a patch. > > Benchmark, before: > > ``` > number of transactions actually processed: 1823 > latency average = 1153.495 ms > latency stddev = 154.366 ms > tps = 6.061104 (including connections establishing) > tps = 6.061211 (excluding connections establishing) > ``` > > Benchmark, after: > > ``` > number of transactions actually processed: 2462 > latency average = 853.862 ms > latency stddev = 112.270 ms > tps = 8.191861 (including connections establishing) > tps = 8.192028 (excluding connections establishing) > ``` > > +35% TPS, just as expected. Feel free to run your own benchmarks on > different datasets and workloads. `perf top` shows that first bottleneck > is completely eliminated. That seems like a good gain. > I did nothing about the second bottleneck > since as Amit mentioned partition-pruning should solve this anyway and > apparently any micro-optimizations don't worth an effort. Sorry, I didn't mean to dissuade you from trying those micro-optimizations. If general inheritance cases can benefit from it (which, until we have a different method, will be used by partitioned tables as well), I think we should try it. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, This issue has bothered me in non-partitioned use-cases recently, so thanks for taking it up. On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote: > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c > index 2fb9a8bf58..fa906e7930 100644 > --- a/src/backend/postmaster/pgstat.c > +++ b/src/backend/postmaster/pgstat.c > @@ -160,6 +160,16 @@ typedef struct TabStatusArray > > static TabStatusArray *pgStatTabList = NULL; Why are we keeping that list / the "batch" allocation pattern? It doesn't actually seem useful to me after these changes. Given that we don't shrink hash-tables, we could just use the hash-table memory for this, no? I think a separate list that only contains things that are "active" in the current transaction makes sense, because the current logic requires iterating over a potentially very large array at transaction commit. > +/* pgStatTabHash entry */ > +typedef struct TabStatHashEntry > +{ > + Oid t_id; > + PgStat_TableStatus* tsa_entry; > +} TabStatHashEntry; > + > +/* Hash table for faster t_id -> tsa_entry lookup */ > +static HTAB *pgStatTabHash = NULL; > + 'faster ... lookup' doesn't strike me as a useful comment, because it's only useful in relation of the current code to the new code. > /* > * Backends store per-function info that's waiting to be sent to the > collector > * in this hash table (indexed by function OID). > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) > pgstat_send_tabstat(this_msg); > this_msg->m_nentries = 0; > } > + > + /* > + * Entry will be freed soon so we need to remove it > from the lookup table. > + */ > + hash_search(pgStatTabHash, &entry->t_id, HASH_REMOVE, > NULL); > } It's not really freed... > static PgStat_TableStatus * > get_tabstat_entry(Oid rel_id, bool isshared) > { > + TabStatHashEntry* hash_entry; > PgStat_TableStatus *entry; > TabStatusArray *tsa; > - TabStatusArray *prev_tsa; > - int i; > + > + /* Try to find an entry */ > + entry = find_tabstat_entry(rel_id); > + if(entry != NULL) > + return entry; Arguably it'd be better to HASH_ENTER directly here, instead of doing two lookups. > /* > - * Search the already-used tabstat slots for this relation. > + * Entry doesn't exist - creating one. > + * First make sure there is a free space in a last element of > pgStatTabList. >*/ > - prev_tsa = NULL; > - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = > tsa->tsa_next) > + if (!pgStatTabList) > { > - for (i = 0; i < tsa->tsa_used; i++) > - { > - entry = &tsa->tsa_entries[i]; > - if (entry->t_id == rel_id) > - return entry; > - } > + HASHCTL ctl; > > - if (tsa->tsa_used < TABSTAT_QUANTUM) > + Assert(!pgStatTabHash); > + > + memset(&ctl, 0, sizeof(ctl)); > + ctl.keysize = sizeof(Oid); > + ctl.entrysize = sizeof(TabStatHashEntry); > + ctl.hcxt = TopMemoryContext; > + > + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup > table", > + TABSTAT_QUANTUM, &ctl, HASH_ELEM | HASH_BLOBS | > HASH_CONTEXT); Think it'd be better to move the hash creation into its own function. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hello. OK, here is a patch. Benchmark, before: ``` number of transactions actually processed: 1823 latency average = 1153.495 ms latency stddev = 154.366 ms tps = 6.061104 (including connections establishing) tps = 6.061211 (excluding connections establishing) ``` Benchmark, after: ``` number of transactions actually processed: 2462 latency average = 853.862 ms latency stddev = 112.270 ms tps = 8.191861 (including connections establishing) tps = 8.192028 (excluding connections establishing) ``` +35% TPS, just as expected. Feel free to run your own benchmarks on different datasets and workloads. `perf top` shows that first bottleneck is completely eliminated. I did nothing about the second bottleneck since as Amit mentioned partition-pruning should solve this anyway and apparently any micro-optimizations don't worth an effort. Also I checked test coverage using lcov to make sure that this code is test covered. An exact script I'm using could be found here [1]. [1] https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh On Wed, Mar 01, 2017 at 10:36:24AM +0900, Amit Langote wrote: > Hi, > > On 2017/02/28 23:25, Aleksander Alekseev wrote: > > Hello. > > > > I decided to figure out whether current implementation of declarative > > partitioning has any bottlenecks when there is a lot of partitions. Here > > is what I did [1]. > > Thanks for sharing. > > > Then: > > > > ``` > > # 2580 is some pk that exists > > echo 'select * from part_test where pk = 2580;' > t.sql > > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax > > ``` > > > > `perf top` showed to bottlenecks [2]. A stacktrace for the first one > > looks like this [3]: > > > > ``` > > 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') > > at pgstat.c:1689 > > 1689if (entry->t_id == rel_id) > > #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 > > '\000') at pgstat.c:1689 > > #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at > > pgstat.c:1666 > > #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at > > heapam.c:1137 > > #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at > > heapam.c:1291 > > (skipped) > > ``` > > > > And here is a stacktrace for the second bottleneck [4]: > > > > ``` > > 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > > numparents=0x0) at pg_inherits.c:199 > > 199 forboth(lo, rels_list, li, rel_numparents) > > #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, > > lockmode=1, numparents=0x0) at pg_inherits.c:199 > > #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, > > rte=0x1b630b8, rti=1) at prepunion.c:1408 > > #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at > > prepunion.c:1335 > > #3 0x00767526 in subquery_planner (glob=0x1b63cc0, > > parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) > > at planner.c:568 > > (skipped) > > ``` > > > > The first one could be easily fixed by introducing a hash table > > (rel_id -> pgStatList entry). Perhaps hash table should be used only > > after some threshold. Unless there are any objections I will send a > > corresponding patch shortly. > > I have never thought about this one really. > > > I didn't explored the second bottleneck closely yet but at first glance > > it doesn't look much more complicated. > > I don't know which way you're thinking of fixing this, but a planner patch > to implement faster partition-pruning will have taken care of this, I > think. As you may know, even declarative partitioned tables currently > depend on constraint exclusion for partition-pruning and planner's current > approach of handling inheritance requires to open all the child tables > (partitions), whereas the new approach hopefully shouldn't need to do > that. I am not sure if looking for a more localized fix for this would be > worthwhile, although I may be wrong. > > Thanks, > Amit > > -- Best regards, Aleksander Alekseev diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 2fb9a8bf58..fa906e7930 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -160,6 +160,16 @@ typedef struct TabStatusArray static TabStatusArray *pgStatTabList = NULL; +/* pgStatTabHash entry */ +typedef struct TabStatHashEntry +{ + Oid t_id; + PgStat_TableStatus* tsa_entry; +} TabStatHashEntry; + +/* Hash table for faster t_id -> tsa_entry lookup */ +static HTAB *pgStatTabHash = NULL; + /* * Backends store per-function info that's waiting to be sent to the collector * in this hash table (indexed by function OID). @@ -815,7 +825,13 @@ pgstat_report_stat(bool force) pgstat_send_tabstat(this_msg); this_msg->m_nentries = 0; } + + /* + * Entry will be freed soon so we need to remove it from the lookup table. + */ + hash_search(pgStatTabH
Re: [HACKERS] Declarative partitioning optimization for large amount of partitions
Hi, On 2017/02/28 23:25, Aleksander Alekseev wrote: > Hello. > > I decided to figure out whether current implementation of declarative > partitioning has any bottlenecks when there is a lot of partitions. Here > is what I did [1]. Thanks for sharing. > Then: > > ``` > # 2580 is some pk that exists > echo 'select * from part_test where pk = 2580;' > t.sql > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax > ``` > > `perf top` showed to bottlenecks [2]. A stacktrace for the first one > looks like this [3]: > > ``` > 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at > pgstat.c:1689 > 1689 if (entry->t_id == rel_id) > #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') > at pgstat.c:1689 > #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at > pgstat.c:1666 > #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at > heapam.c:1137 > #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at > heapam.c:1291 > (skipped) > ``` > > And here is a stacktrace for the second bottleneck [4]: > > ``` > 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > numparents=0x0) at pg_inherits.c:199 > 199 forboth(lo, rels_list, li, rel_numparents) > #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, > numparents=0x0) at pg_inherits.c:199 > #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, > rte=0x1b630b8, rti=1) at prepunion.c:1408 > #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at > prepunion.c:1335 > #3 0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, > parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568 > (skipped) > ``` > > The first one could be easily fixed by introducing a hash table > (rel_id -> pgStatList entry). Perhaps hash table should be used only > after some threshold. Unless there are any objections I will send a > corresponding patch shortly. I have never thought about this one really. > I didn't explored the second bottleneck closely yet but at first glance > it doesn't look much more complicated. I don't know which way you're thinking of fixing this, but a planner patch to implement faster partition-pruning will have taken care of this, I think. As you may know, even declarative partitioned tables currently depend on constraint exclusion for partition-pruning and planner's current approach of handling inheritance requires to open all the child tables (partitions), whereas the new approach hopefully shouldn't need to do that. I am not sure if looking for a more localized fix for this would be worthwhile, although I may be wrong. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Declarative partitioning optimization for large amount of partitions
Hello. I decided to figure out whether current implementation of declarative partitioning has any bottlenecks when there is a lot of partitions. Here is what I did [1]. ``` -- init schema \timing on CREATE TABLE part_test (pk int not null, k int, v varchar(128)) PARTITION BY RANGE(pk); do $$ declare i integer; begin for i in 1 .. 1 loop raise notice 'i = %', i; execute ('CREATE TABLE part_test_' || i || ' PARTITION OF part_test FOR VALUES FROM (' || (1 + (i-1)*1000) || ') to (' || ( (i * 1000) + 1) || ');' ); end loop; end $$; -- fill tables with some data do $$ declare i integer; begin for i in 1 .. 100*1000 loop raise notice 'i = %', i; execute ('insert into part_test values ( ceil(random()*(1-1)*1000), ceil(random()*1*1000), || ceil(random()*1*1000) );'); end loop; end $$; ``` Then: ``` # 2580 is some pk that exists echo 'select * from part_test where pk = 2580;' > t.sql pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax ``` `perf top` showed to bottlenecks [2]. A stacktrace for the first one looks like this [3]: ``` 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689 1689if (entry->t_id == rel_id) #0 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at pgstat.c:1689 #1 0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666 #2 0x004c7090 in relation_open (relationId=25696, lockmode=0) at heapam.c:1137 #3 0x004c72c9 in heap_open (relationId=25696, lockmode=0) at heapam.c:1291 (skipped) ``` And here is a stacktrace for the second bottleneck [4]: ``` 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199 199 forboth(lo, rels_list, li, rel_numparents) #0 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, numparents=0x0) at pg_inherits.c:199 #1 0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, rte=0x1b630b8, rti=1) at prepunion.c:1408 #2 0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at prepunion.c:1335 #3 0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568 (skipped) ``` The first one could be easily fixed by introducing a hash table (rel_id -> pgStatList entry). Perhaps hash table should be used only after some threshold. Unless there are any objections I will send a corresponding patch shortly. I didn't explored the second bottleneck closely yet but at first glance it doesn't look much more complicated. Please don't hesitate to share your thoughts regarding this matter. [1] http://afiskon.ru/s/e3/5f47af9102_benchmark.txt [2] http://afiskon.ru/s/00/2008c4ae66_temp.png [3] http://afiskon.ru/s/23/650f0afc89_stack.txt [4] http://afiskon.ru/s/03/a7e685a4db_stack2.txt -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature