Re: Should we warn against using too many partitions?
I made another pass, hopefully it's useful and not too much of a pain. diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index cce1618fc1..be2ca3be48 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4674,6 +4675,88 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01'; + + + Declarative Partitioning Best Practices + + +The choice of how to partition a table should be made carefully as the +performance of query planning and execution can be negatively affected by +poorly made design decisions. Maybe just "poor design" +partitioned table. WHERE clause items that match and +are compatible with the partition key can be used to prune away unneeded remove "away" ? +requirements for the PRIMARY KEY or a +UNIQUE constraint. Removal of unwanted data is also +a factor to consider when planning your partitioning strategy as an entire +partition can be removed fairly quickly. However, if data that you want Can we just say "dropped" ? On my first (re)reading, I briefly thought this was now referring to "pruning" as "removal". +to keep exists in that partition then that means having to resort to using +DELETE instead of removing the partition. + + + +Choosing the target number of partitions by which the table should be +divided into is also a critical decision to make. Not having enough Should be: ".. target number .. into which .. should be divided .." +partitions may mean that indexes remain too large and that data locality +remains poor which could result in poor cache hit ratios. However, Change the 2nd remains to "is" and the second poor to "low" ? +dividing the table into too many partitions can also cause issues. +Too many partitions can mean slower query planning times and higher memory s/slower/longer/ +consumption during both query planning and execution. It's also important +to consider what changes may occur in the future when choosing how to +partition your table. For example, if you choose to have one partition Remove "when choosing ..."? Or say: |When choosing how to partition your table, it's also important to consider |what changes may occur in the future. +per customer and you currently have a small number of large customers, +what will the implications be if in several years you obtain a large +number of small customers. In this case, it may be better to choose to +partition by HASH and choose a reasonable number of +partitions rather than trying to partition by LIST and +hoping that the number of customers does not increase significantly over +time. + It's an unusual thing for which to hope :) + +Sub-partitioning can be useful to further divide partitions that are +expected to become larger than other partitions, although excessive +sub-partitioning can easily lead to large numbers of partitions and can +cause the problems mentioned in the preceding paragraph. + cause the SAME problems ? +It is also important to consider the overhead of partitioning during +query planning and execution. The query planner is generally able to +handle partition hierarchies up a few thousand partitions fairly well, +provided that typical queries prune all but a small number of partitions +during query planning. Planning times become slower and memory s/slower/longer/ Hm, maybe say "typical queries ALLOW PRUNNING .." +consumption becomes higher when more partitions remain after the planner +performs partition pruning. This is particularly true for the Just say: "remain after planning" ? +UPDATE and DELETE commands. Also, +even if most queries are able to prune a large number of partitions during +query planning, it still may be undesirable to have a large number of may still ? +partitions as each partition requires metadata about the partition to be +stored in each session that touches it. If each session touches a large stored for ? +number of partitions over a period of time then the memory consumption for +this may become significant. + Remove "over a period of time" ? Add a comma? Maybe say: |If each session touches a large number of partitions, then the memory |overhead may become significant. + +With data warehouse type workloads it can make sense to use a larger +number of partitions than with an OLTP type workload. Generally, in data +warehouses, query planning time is less of a concern as the majority of VAST majority? Or "essentially all"? Or " .. query planning time is insignificant compared to the time spent during query execution. +processing time is spent during query execution. With either of these two +types of workload it is important to make the right decisions early as early COMMA Justin
Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !
Hello Avinash, I was testing bloom indexes today. I understand bloom indexes uses bloom filters. [...] So the question here is - I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time. Also, both the indexes have ZERO false positives. Please let me know if there is anything simple that i am missing here. You may have a look at the blog entry about these parameters I redacted a few year ago: http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html -- Fabien.
Re: [PROPOSAL] Drop orphan temp tables in single-mode
Hello Alexander, On Friday, June 7, 2019, Alexander Korotkov wrote: > BTW, does this patch checks that temporary table is really orphan? > AFAICS, user may define some temporary tables in single-user mode > before running VACUUM. As far as I remember, the patch checks it. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Artur
Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !
Hi All, I was testing bloom indexes today. I understand bloom indexes uses bloom filters. As i understand, a bloom filter is a bit array of m bits and a constant "k" number of hash functions are used to generate hashes for the data. And then the appropriate bits are set to 1. I was doing the following test where i was generating 10 million records and testing bloom indexes. CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int); INSERT INTO foo.bar SELECT (random() * 100)::int, (random() * 100)::int,(random() * 100)::int,(random() * 100)::int,(random() * 100)::int,(random() * 100)::int, (random() * 100)::int,(random() * 100)::int,md5(g::text), floor(random()* (2- + 1) + ) from generate_series(1,100*1e6) g; As per the documentation, bloom indexes accepts 2 parameters. *length* and the *number of bits for each column*. Here is the problem or the question i have. I have created the following 2 Indexes. *Index 1* CREATE INDEX idx_bloom_bar ON foo.bar USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode) WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4); *Index 2* CREATE INDEX idx_bloom_bar ON foo.bar USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode) WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2, col8=2); With change in length, we of course see a difference in the Index size which is understandable. Here i have the same length for both indexes. But, i reduced the number of bits per each index column from 4 to 2. Both the above indexes are of the same size. But, there is a very slight difference in the execution time between Index 1 and Index 2 but with the same cost. So the question here is - I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time. Also, both the indexes have ZERO false positives. Please let me know if there is anything simple that i am missing here. *Query * EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833 and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5 = 326756 and id6 = 597560 and zipcode = 10545; *With Index 1 * QUERY PLAN Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=307.000..307.000 rows=0 loops=1) Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545)) Buffers: shared hit=147059 -> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=306.997..306.997 rows=0 loops=1) Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545)) Buffers: shared hit=147059 Planning Time: 0.106 ms * Execution Time: 307.030 ms* (9 rows) *With Index 2 * QUERY PLAN Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=420.881..420.881 rows=0 loops=1) Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545)) Buffers: shared hit=147059 -> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=420.878..420.878 rows=0 loops=1) Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545)) Buffers: shared hit=147059 Planning Time: 0.104 ms * Execution Time: 420.913 ms* (9 rows) Thanks, Avinash.
Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)
On Fri, Jun 07, 2019 at 05:26:32PM -0700, Andres Freund wrote: > I was more thinking that we'd move the check for orphaned-ness into a > separate function (maybe IsOrphanedRelation()), and move the code to > drop orphan relations into a separate function (maybe > DropOrphanRelations()). That'd limit the amount of code duplication for > doing this both in autovacuum and all-database vacuums quite > considerably. A separation makes sense. At some point we should actually try to separate vacuum and orphan relation cleanup, so separate functions make sense. The only reason why we are doing it with autovacuum is that it is the only thing in-core spawning a worker connected to a database which does a full scan of pg_class. -- Michael signature.asc Description: PGP signature
Re: be-gssapi-common.h should be located in src/include/libpq/
On Fri, Jun 07, 2019 at 09:52:26AM +0200, Daniel Gustafsson wrote: > While looking at libpq.h I noticed what seems to be a few nitpicks: the GSS > function prototype isn’t using the common format of having a comment > specifying > the file it belongs to; ssl_loaded_verify_locations is defined as extern even > though it’s only available under USE_SSL (which works fine since it’s only > accessed under USE_SSL but seems kinda wrong); and FeBeWaitSet is not listed > under the pqcomm.c prototypes like how the extern vars from be-secure.c are. > All of these are in the attached. Indeed, this makes the header more consistent. Thanks for noticing. -- Michael signature.asc Description: PGP signature
Re: be-gssapi-common.h should be located in src/include/libpq/
On Fri, Jun 07, 2019 at 08:11:07AM -0400, Stephen Frost wrote: > I'm pretty sure it ended up there just because that's how things are in > src/interfaces/libpq. I don't have any objection to moving it, I had > really just been waiting to see where that thread ended up going. > > On a quick look, the patch looks fine to me. OK thanks. I have committed this portion of the patch for now. If there are any remaining issues let's take care of them afterwards. -- Michael signature.asc Description: PGP signature
Re: Binary support for pgoutput plugin
Hi, On 2019-06-07 21:16:12 -0400, Chapman Flack wrote: > On 06/07/19 21:01, Andres Freund wrote: > > On 2019-06-07 20:52:38 -0400, Chapman Flack wrote: > > That'd be a *lot* of additional complexity, and pretty much prohibitive > > from a performance POV. We'd have to not continue decoding on the server > > side *all* the time to give the client a chance to inquire additional > > information. > > Does anything travel in the client->server direction during replication? > I thought I saw CopyBoth mentioned. Is there a select()/poll() being done > so those messages can be received? Yes, acknowledgements of how far data has been received (and how far processed), which is then used to release resources (WAL, xid horizon) and allow synchronous replication to block until something has been received. - Andres
Re: Binary support for pgoutput plugin
On 06/07/19 21:01, Andres Freund wrote: > On 2019-06-07 20:52:38 -0400, Chapman Flack wrote: >> It seems they had ended up designing a whole 'nother "protocol level" >> involving queries wrapping their results as JSON and an app layer that >> unwraps again, after trying a simpler first approach that was foiled by the >> inability to see into arrays and anonymous record types in the 'describe' >> response. > > I suspect quite a few people would have to have left the projectbefore > this would happen. I'm not sure I understand what you're getting at. The "whole 'nother protocol" was something they actually implemented, at the application level, by rewriting their queries to produce JSON and their client to unwrap it. It wasn't proposed to go into postgres ... but it was a workaround they were forced into by the current protocol's inability to tell them what they were getting. > That'd be a *lot* of additional complexity, and pretty much prohibitive > from a performance POV. We'd have to not continue decoding on the server > side *all* the time to give the client a chance to inquire additional > information. Does anything travel in the client->server direction during replication? I thought I saw CopyBoth mentioned. Is there a select()/poll() being done so those messages can be received? It does seem that the replication protocol would be the tougher problem. For the regular extended-query protocol, it seems like allowing an extra Describe or two before Execute might not be awfully hard. Regards, -Chap
Re: Binary support for pgoutput plugin
Hi, On 2019-06-07 20:52:38 -0400, Chapman Flack wrote: > It seems they had ended up designing a whole 'nother "protocol level" > involving queries wrapping their results as JSON and an app layer that > unwraps again, after trying a simpler first approach that was foiled by the > inability to see into arrays and anonymous record types in the 'describe' > response. I suspect quite a few people would have to have left the projectbefore this would happen. > I thought, in a new protocol rev, why not let the driver send additional > 'describe' messages after the first one, to drill into structure of > individual columns mentioned in the first response, before sending the > 'execute' message? > > If it doesn't want the further detail, it doesn't have to ask. > > > And then we suddenly need tracking for all these, so we don't always > > send out that information when we previously already did > > If it's up to the client driver, it can track what it needs or already has. > I haven't looked too deeply into the replication protocol ... it happens > under a kind of copy-both, right?, so maybe there's a way for the receiver > to send some inquiries back, but maybe in a windowed, full-duplex way where > it might have to buffer some incoming messages before getting the response > to an inquiry message it sent. That'd be a *lot* of additional complexity, and pretty much prohibitive from a performance POV. We'd have to not continue decoding on the server side *all* the time to give the client a chance to inquire additional information. Greetings, Andres Freund
Re: Binary support for pgoutput plugin
On 06/07/19 19:27, Andres Freund wrote: > The problem is that I don't recognize a limiting principle: > > If we want NOT NULL information for clients, why don't we include the > underlying types for arrays, and the fields in composite types? What > about foreign keys? And unique keys? This reminds me of an idea I had for a future fe/be protocol version, right after a talk by Alyssa Ritchie and Henrietta Dombrovskaya at the last 2Q PGConf. [1] It seems they had ended up designing a whole 'nother "protocol level" involving queries wrapping their results as JSON and an app layer that unwraps again, after trying a simpler first approach that was foiled by the inability to see into arrays and anonymous record types in the 'describe' response. I thought, in a new protocol rev, why not let the driver send additional 'describe' messages after the first one, to drill into structure of individual columns mentioned in the first response, before sending the 'execute' message? If it doesn't want the further detail, it doesn't have to ask. > And then we suddenly need tracking for all these, so we don't always > send out that information when we previously already did If it's up to the client driver, it can track what it needs or already has. I haven't looked too deeply into the replication protocol ... it happens under a kind of copy-both, right?, so maybe there's a way for the receiver to send some inquiries back, but maybe in a windowed, full-duplex way where it might have to buffer some incoming messages before getting the response to an inquiry message it sent. Would those be thinkable thoughts for a future protocol rev? Regards, -Chap [1] https://www.2qpgconf.com/schedule/information-exchange-techniques-for-javapostgresql-applications/
Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)
Hi, On 2019-06-08 08:59:37 +0900, Michael Paquier wrote: > On Fri, Jun 07, 2019 at 03:58:43PM -0700, Andres Freund wrote: > > Do we need to move the orphan temp cleanup code into database vacuums or > > such? > > When entering into the vacuum() code path for an autovacuum, only one > relation at a time is processed, and we have prior that extra > processing related to toast relations when selecting the relations to > work on, or potentially delete orphaned temp tables. For a manual > vacuum, we finish by deciding which relation to process in > get_all_vacuum_rels(), so the localized processing is a bit different > than what's done in do_autovacuum() when scanning pg_class for > relations. Yea, I know. I didn't mean that we should only handle orphan cleanup only within database wide vacuums, just *also* there. ISTM that'd mean that at least some of the code ought to be in vacuum.c, and then also called by autovacuum.c. > Technically, I think that it would work to give up on the gathering of > the orphaned OIDs in a gathering and let them be gathered in the list > of items to vacuum, and then put the deletion logic down to > vacuum_rel(). I don't think it makes much sense to go there. The API would probably look pretty bad. I was more thinking that we'd move the check for orphaned-ness into a separate function (maybe IsOrphanedRelation()), and move the code to drop orphan relations into a separate function (maybe DropOrphanRelations()). That'd limit the amount of code duplication for doing this both in autovacuum and all-database vacuums quite considerably. A more aggressive approach would be to teach vac_update_datfrozenxid() to ignore orphaned temp tables - perhaps even by heap_inplace'ing an orphaned table's relfrozenxid/relminmxid with InvalidTransactionId. We'd not want to do that in do_autovacuum() - otherwise the schema won't get cleaned up, but for database widevacuums that seems like it could be good approach. Random observation while re-reading this code: Having do_autovacuum() and ExecVacuum() both go through vacuum() seems like it adds too much complexity to be worth it. Like half of the file is only concerned with checks related to that. Greetings, Andres Freund
Re: Custom table AMs need to include heapam.h because of BulkInsertState
On Fri, Jun 07, 2019 at 08:55:36AM -0400, Robert Haas wrote: > Are you talking about the call to ReleaseBulkInsertStatePin, or something > else? Yes, I was referring to ReleaseBulkInsertStatePin() -- Michael signature.asc Description: PGP signature
Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)
On Fri, Jun 07, 2019 at 03:58:43PM -0700, Andres Freund wrote: > Do we need to move the orphan temp cleanup code into database vacuums or > such? When entering into the vacuum() code path for an autovacuum, only one relation at a time is processed, and we have prior that extra processing related to toast relations when selecting the relations to work on, or potentially delete orphaned temp tables. For a manual vacuum, we finish by deciding which relation to process in get_all_vacuum_rels(), so the localized processing is a bit different than what's done in do_autovacuum() when scanning pg_class for relations. Technically, I think that it would work to give up on the gathering of the orphaned OIDs in a gathering and let them be gathered in the list of items to vacuum, and then put the deletion logic down to vacuum_rel(). However, there is a take: for autovacuum we gather the orphaned entries and the other relations to process, then drop all the orphaned OIDs, and finally vacuum/analyze the entries collected. So if you put the deletion logic down into vacuum_rel() then we won't be able to drop orphaned tables before working on a database, which would be bad if we know about an orphaned set, but autovacuum works for a long time on other legit entries first. -- Michael signature.asc Description: PGP signature
Re: Binary support for pgoutput plugin
Hi, On 2019-06-05 19:05:05 -0400, Dave Cramer wrote: > I am curious why you are "strongly" opposed however. We already have the > information. Adding doesn't seem onerous. (thought I'd already replied with this) The problem is that I don't recognize a limiting principle: If we want NOT NULL information for clients, why don't we include the underlying types for arrays, and the fields in composite types? What about foreign keys? And unique keys? And then we suddenly need tracking for all these, so we don't always send out that information when we previously already did - and in some of the cases there's no infrastructure for that. I just don't quite buy that the output plugin build for pg's logical replication needs is a good place to include a continually increasing amount of metadata that logical replication doesn't need. That's going to add overhead and make the code more complicated. Greetings, Andres Freund
Temp table handling after anti-wraparound shutdown (Was: BUG #15840)
Hi, (Moving a part of this discussion to hackers) In #15840 Thierry had the situation that autovacuum apparently could not keep up, and he ended up with a wraparound situation. Following the hints and shutting down the cluster and vacuuming the relevant DB in single user mode did not resolve the issue however. That's because there was a session with temp tables: On 2019-06-07 16:40:27 -0500, Thierry Husson wrote: > oid | oid | relkind | relfrozenxid | > age > +--+-+--+ > 460564 | pg_temp_3.cur_semt700_progsync_4996 | r |36464 | > 2146483652 > 460764 | pg_temp_8.cur_semt700_progsync_5568 | r | 19836544 | > 2126683572 > 460718 | pg_temp_4.cur_semt700_progsync_5564 | r | 19836544 | > 2126683572 > 460721 | pg_temp_5.cur_semt700_progsync_5565 | r | 19836544 | > 2126683572 > 461068 | pg_temp_22.cur_semt700_progsync_5581 | r | 19836544 | > 2126683572 > > These are temporary tables to manage concurrency & server load. It seems the > sudden disconnection due to wraparound protection didn't get them removed. I > removed them manually under single mode and there is no more warning now, > vacuum command included. Your command is very interesting to know. And our logic for dropping temp tables only kicks in autovacuum, but not in a database manual VACUUM. Which means that currently the advice we give, namely to shut down and vacuum the database in singleuser mode plainly doesn't work. Without any warnings hinting in the right direction. Do we need to move the orphan temp cleanup code into database vacuums or such? Greetings, Andres Freund
Re: tableam: abstracting relation sizing code
> On 7 Jun 2019, at 14:43, Robert Haas wrote: > I think that's probably going in the wrong direction. It's arguable, > of course. However, it seems to me that there's nothing heap-specific > about the number 10. It's not computed based on the format of a heap > page or a heap tuple. It's just somebody's guess (likely Tom's) about > how to plan with empty relations. If somebody finds that another > number works better for their AM, it's probably also better for heap, > at least on that person's workload. Fair enough, that makes sense. > Good catch, and now I notice that the callback is not called > estimate_rel_size but relation_estimate_size. Updated patch attached; > thanks for the review. The commit message still refers to it as estimate_rel_size though. The comment on table_block_relation_estimate_size does too but that one might be intentional. The v3 patch applies cleanly and passes tests (I did not see the warning that Alvaro mentioned, so yay for testing on multiple compilers). During re-review, the below part stood out as a bit odd however: + if (curpages < 10 && + relpages == 0 && + !rel->rd_rel->relhassubclass) + curpages = 10; + + /* report estimated # pages */ + *pages = curpages; + /* quick exit if rel is clearly empty */ + if (curpages == 0) + { + *tuples = 0; + *allvisfrac = 0; + return; + } While I know this codepath isn’t introduced by this patch (it was introduced in 696d78469f3), I hadn’t seen it before so sorry for thread-jacking slightly. Maybe I’m a bit thick but if the rel is totally empty and without children, then curpages as well as relpages would be both zero. But if so, how can we enter the second "quick exit” block since curpages by then will be increased to 10 in the block just before for the empty case? It seems to me that the blocks should be the other way around to really have a fast path, but I might be missing something. cheers ./daniel
Re: heapam_index_build_range_scan's anyvisible
On 2019-Jun-07, Robert Haas wrote: > Yeah, I wondered whether SnapshotNonVacuumable might've been added > later, but I was too lazy to check the commit log. I'll try coding up > that approach and see how it looks. Thanks. > But do you have any comment on the question of whether this function > is actually safe with < ShareLock, per the comments about caching > HOT-related state across buffer lock releases? Well, as far as I understand we do hold a buffer pin on the page the whole time until we abandon it, which prevents HOT pruning, so the root offset cache should be safe (since heap_page_prune requires cleanup lock). The thing we don't keep held is a buffer lock, so I/U/D could occur, but those are not supposed to be hazards for the BRIN use, since that's covered by the anyvisible / SnapshotNonVacuumable hack^Wtechnique. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: heapam_index_build_range_scan's anyvisible
On Fri, Jun 7, 2019 at 4:30 PM Alvaro Herrera wrote: > On 2019-Jun-07, Robert Haas wrote: > > I spent some time today studying heapam_index_build_range_scan and > > quickly reached the conclusion that it's kind of a mess. At heart > > it's pretty simple: loop over all the table, check each tuple against > > any qual, and pass the visible ones to the callback. However, in an > > attempt to make it cater to various needs slightly outside of its > > original design purpose, various warts have been added, and there are > > enough of them now that I at least find it fairly difficult to > > understand. One of those warts is anyvisible, which I gather was > > added in support of BRIN. > > Yes, commit 2834855cb9fd added that flag. SnapshotNonVacuumable did not > exist back then. It seems like maybe it would work to remove the flag > and replace with passing SnapshotNonVacuumable. The case that caused > that flag to be added is tested by a dedicated isolation test, so if > BRIN becomes broken by the change at least it'd be obvious ... Yeah, I wondered whether SnapshotNonVacuumable might've been added later, but I was too lazy to check the commit log. I'll try coding up that approach and see how it looks. But do you have any comment on the question of whether this function is actually safe with < ShareLock, per the comments about caching HOT-related state across buffer lock releases? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: heapam_index_build_range_scan's anyvisible
On 2019-Jun-07, Robert Haas wrote: > I spent some time today studying heapam_index_build_range_scan and > quickly reached the conclusion that it's kind of a mess. At heart > it's pretty simple: loop over all the table, check each tuple against > any qual, and pass the visible ones to the callback. However, in an > attempt to make it cater to various needs slightly outside of its > original design purpose, various warts have been added, and there are > enough of them now that I at least find it fairly difficult to > understand. One of those warts is anyvisible, which I gather was > added in support of BRIN. Yes, commit 2834855cb9fd added that flag. SnapshotNonVacuumable did not exist back then. It seems like maybe it would work to remove the flag and replace with passing SnapshotNonVacuumable. The case that caused that flag to be added is tested by a dedicated isolation test, so if BRIN becomes broken by the change at least it'd be obvious ... -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
heapam_index_build_range_scan's anyvisible
I spent some time today studying heapam_index_build_range_scan and quickly reached the conclusion that it's kind of a mess. At heart it's pretty simple: loop over all the table, check each tuple against any qual, and pass the visible ones to the callback. However, in an attempt to make it cater to various needs slightly outside of its original design purpose, various warts have been added, and there are enough of them now that I at least find it fairly difficult to understand. One of those warts is anyvisible, which I gather was added in support of BRIN. I first spent some time looking at how the 'anyvisible' flag affects the behavior of the function. AFAICS, setting the flag to true results in three behavior changes: 1. The elog(WARNING, ...) calls about a concurrent insert/delete calls in progress can't be reached. 2. In some cases, reltuples += 1 might not occur where it would've happened otherwise. 3. If we encounter a HOT-updated which was deleted by our own transaction, we index it instead of skipping it. Change #2 doesn't matter because the only caller that passes anyvisible = true seems to be BRIN, and BRIN ignores the return value. I initially thought that change #1 must not matter either, because function has comments in several places saying that the caller must hold ShareLock or better. And I thought change #3 must also not matter, because as the comments explain, this function is used to build indexes, and if our CREATE INDEX command commits, then any deletions that it has already performed will commit too, so the fact that we haven't indexed the now-deleted tuples will be fine. Then I realized that brin_summarize_new_values() is calling this function *without* ShareLock and for *not* for the purpose of creating a new index but rather for the purpose of updating an existing index, which means #1 and #3 do matter after all. But I think it's kind of confusing because anyvisible doesn't change anything about which tuples are visible. SnapshotAny is already making "any" tuple "visible." This flag really means "caller is holding a lower-than-normal lock level and is not inserting into a brand new relfilnode". There may be more than just a cosmetic problem here, because the comments say: * It might look unsafe to use this information across buffer * lock/unlock. However, we hold ShareLock on the table so no * ordinary insert/update/delete should occur; and we hold pin on the * buffer continuously while visiting the page, so no pruning * operation can occur either. In the BRIN case that doesn't apply; I don't know whether this is safe in that case for some other reason. I also note that amcheck's bt_check_every_level can also call this without ShareLock. It doesn't need to set anyvisible because passing a snapshot bypasses the WARNINGs anyway, but it might have whatever problem the above comment is thinking about. Also, it's just cosmetic, but this comment definitely needs updating: /* * We could possibly get away with not locking the buffer here, * since caller should hold ShareLock on the relation, but let's * be conservative about it. (This remark is still correct even * with HOT-pruning: our pin on the buffer prevents pruning.) */ LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); One more thing. Assuming that there are no live bugs here, or that we fix them, another possible simplification would be to remove the anyvisible = true flag and have BRIN pass SnapshotNonVacuumable. SnapshotNonVacuumable returns true when HeapTupleSatisfiesVacuum doesn't return HEAPTUPLE_DEAD, so I think we'd get exactly the same behavior (again, modulo reltuples, which doesn't matter). heap_getnext() would perform functionally the same check as the bespoke code internally, and just wouldn't return the dead tuples in the first place. There's an assertion that would trip, but we could probably just change it. BRIN's callback might also get a different value for tupleIsAlive in some cases, but it ignores that value anyway. So to summarize: 1. Is this function really safe with < ShareLock? Both BRIN and amcheck think so, but the function itself isn't sure. If yes, we need to adapt the comments. If no, we need to think about some other fix. 2. anyvisible is a funny name given what the flag really does. Maybe we can simplify by replacing it with SnapshotNonVacuumable(). Otherwise maybe we should rename the flag. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_dump: fail to restore partition table with serial type
On 2019-May-06, Alvaro Herrera wrote: > On 2019-May-06, Rushabh Lathia wrote: > > > Found another scenario where check constraint is not getting > > dump for the child table. > > You're right, the patched code is bogus; I'm reverting it all for > today's minors. Thanks for reporting. Here's another version of this patch. This time, I added some real tests in pg_dump's suite, including a SERIAL column and NOT NULL constraints. The improved test verifies that the partition is created separately and later attached, and it includes constraints from the parent as well as some locally defined ones. I also added tests with legacy inheritance, which was not considered previously in pg_dump tests as far as I could see. I looked for other cases that could have been broken by changing the partition creation methodology in pg_dump, and didn't find anything. That part of pg_dump (dumpTableSchema) is pretty spaghettish, though; the fact that shouldPrintColumn makes some partitioned-related decisions and then dumpTableSchema make them again is notoriously confusing. I could have easily missed something. One weird thing about pg_dump's output of the serial column in a partitioned table is that it emits the parent table itself first without a DEFAULT clause, then the sequence and marks it as owned by the column; then it emits the partition *with* the default clause, and finally it alters the parent table's column to set the default. Now there is some method in this madness (the OWNED BY clause for the sequence is mingled together with the sequence itself), but I think this arrangement makes a partial restore of the partition fail. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 72ad894d79b6f26d24a8857e5f33c6d51bd65051 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 24 Apr 2019 15:30:37 -0400 Subject: [PATCH v1 1/2] Make pg_dump emit ATTACH PARTITION instead of PARTITION OF MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Using PARTITION OF can result in column ordering being changed from the database being dumped, if the partition uses a column layout different from the parent's. It's not pg_dump's job to editorialize on table definitions, so this is not acceptable; back-patch all the way back to pg10, where partitioned tables where introduced. This change also ensures that partitions end up in the correct tablespace, if different from the parent's; this is an oversight in ca4103025dfe (in pg12 only). Partitioned indexes (in pg11) don't have this problem, because they're already created as independent indexes and attached to their parents afterwards. This change also has the advantage that the partition is restorable from the dump (as a standalone table) even if its parent table isn't restored. Author: David Rowley Reviewed-by: Álvaro Herrera Discussion: https://postgr.es/m/cakjs1f_1c260not_vbj067az3jxptxvrohdvmlebmudx1ye...@mail.gmail.com Discussion: https://postgr.es/m/20190423185007.GA27954@alvherre.pgsql --- src/bin/pg_dump/pg_dump.c| 123 +-- src/bin/pg_dump/t/002_pg_dump.pl | 12 ++- 2 files changed, 60 insertions(+), 75 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 9f59cc74ee..408eee9119 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -8631,9 +8631,12 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) * Normally this is always true, but it's false for dropped columns, as well * as those that were inherited without any local definition. (If we print * such a column it will mistakenly get pg_attribute.attislocal set to true.) - * However, in binary_upgrade mode, we must print all such columns anyway and - * fix the attislocal/attisdropped state later, so as to keep control of the - * physical column order. + * For partitions, it's always true, because we want the partitions to be + * created independently and ATTACH PARTITION used afterwards. + * + * In binary_upgrade mode, we must print all columns and fix the attislocal/ + * attisdropped state later, so as to keep control of the physical column + * order. * * This function exists because there are scattered nonobvious places that * must be kept in sync with this decision. @@ -8643,7 +8646,9 @@ shouldPrintColumn(DumpOptions *dopt, TableInfo *tbinfo, int colno) { if (dopt->binary_upgrade) return true; - return (tbinfo->attislocal[colno] && !tbinfo->attisdropped[colno]); + if (tbinfo->attisdropped[colno]) + return false; + return (tbinfo->attislocal[colno] || tbinfo->ispartition); } @@ -15594,27 +15599,6 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo) if (tbinfo->reloftype && !dopt->binary_upgrade) appendPQExpBuffer(q, " OF %s", tbinfo->reloftype); - /* - * If the table is a partition, dump it as such; except in the case of - * a binary upgrade, w
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Fri, Dec 28, 2018 at 11:32 AM Masahiko Sawada wrote: > On Thu, Dec 27, 2018 at 10:24 PM Alvaro Herrera > wrote: > > > > On 2018-Dec-27, Alexey Kondratov wrote: > > > > > To summarize: > > > > > > 1) Alvaro and Michael agreed, that REINDEX with tablespace move may be > > > useful. This is done in the patch attached to my initial email. Adding > > > REINDEX to ALTER TABLE as new action seems quite questionable for me and > > > not > > > completely semantically correct. ALTER already looks bulky. > > > > Agreed on these points. > > As an alternative idea, I think we can have a new ALTER INDEX variants > that rebuilds the index while moving tablespace, something like ALTER > INDEX ... REBUILD SET TABLESPACE +1 It seems the easiest way to have feature-full commands. If we put functionality of CLUSTER and VACUUM FULL to ALTER TABLE, and put functionality of REINDEX to ALTER INDEX, then CLUSTER, VACUUM FULL and REINDEX would be just syntax sugar. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [PROPOSAL] Drop orphan temp tables in single-mode
On Fri, Mar 8, 2019 at 9:28 AM Michael Paquier wrote: > On Thu, Mar 07, 2019 at 10:49:29AM -0500, Robert Haas wrote: > > On Thu, Mar 7, 2019 at 10:24 AM Tom Lane wrote: > > > So if we think we can invent a "MAGICALLY FIX MY DATABASE" command, > > > let's do that. But please let's not turn a well defined command > > > like VACUUM into something that you don't quite know what it will do. > > > > I am on the fence about that. I see your point, but on the other > > hand, autovacuum drops temp tables all the time in multi-user mode and > > I think it's pretty clear that, with the possible exception of you, > > users find that an improvement. So it could be argued that we're > > merely proposing to make the single-user mode behavior of vacuum > > consistent with the behavior people are already expecting it to do. > > It is possible for a session to drop temporary tables of other > sessions. Wouldn't it work as well in this case for single-user mode > when seeing an orphan temp table still defined? Like Tom, I don't > think that it is a good idea to play with the heuristics of VACUUM in > the way the patch proposes. I think we have a kind of agreement, that having simple way to get rid of all orphan tables in single-user mode is good. The question is whether it's good for VACUUM command to do this. Naturally, user may enter single-user mode for different reasons, not only for xid wraparound fixing. For example, one may enter this mode for examining temporary tables present in the database. Then it would be disappointing surprise that all of them gone after VACUUM command. So, what about special option, which would make VACUUM to drop orphan tables in single-user mode? Do we need it in multi-user mode too? BTW, does this patch checks that temporary table is really orphan? AFAICS, user may define some temporary tables in single-user mode before running VACUUM. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: tableam: inconsistent parameter name
On Fri, Jun 7, 2019 at 1:19 PM Andres Freund wrote: > On 2019-06-07 13:11:21 -0400, Robert Haas wrote: > > I found what appears to be another typo very nearby. Attached please > > find v2, fixing both issues. > > Hm, I thinks that's fixed already? Oops, you're right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: tableam: inconsistent parameter name
On 2019-06-07 13:11:21 -0400, Robert Haas wrote: > I found what appears to be another typo very nearby. Attached please > find v2, fixing both issues. Hm, I thinks that's fixed already?
Re: tableam: inconsistent parameter name
On Fri, Jun 7, 2019 at 12:52 PM Andres Freund wrote: > On 2019-06-07 12:37:33 -0400, Robert Haas wrote: > > TableAmRoutine's index_build_range_scan thinks that parameter #8 is > > called end_blockno, but table_index_build_range_scan and > > heapam_index_build_range_scan and BRIN's summarize_range all agree > > that it's the number of blocks to be scanned. I assume that this got > > changed at some point while Andres was hacking on it and this one > > place just never got updated. > > Not sure where it came from :/ > > > Proposed patch attached. Since this seems like a bug, albeit a > > harmless one, I propose to commit this to v12. > > Yea, please do! I found what appears to be another typo very nearby. Attached please find v2, fixing both issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v2-0001-tableam-Fix-cosmetic-mistakes-related-to-index_bu.patch Description: Binary data
Re: tableam: inconsistent parameter name
Hi, On 2019-06-07 12:37:33 -0400, Robert Haas wrote: > TableAmRoutine's index_build_range_scan thinks that parameter #8 is > called end_blockno, but table_index_build_range_scan and > heapam_index_build_range_scan and BRIN's summarize_range all agree > that it's the number of blocks to be scanned. I assume that this got > changed at some point while Andres was hacking on it and this one > place just never got updated. Not sure where it came from :/ > Proposed patch attached. Since this seems like a bug, albeit a > harmless one, I propose to commit this to v12. Yea, please do! Greetings, Andres Freund
Re: Custom table AMs need to include heapam.h because of BulkInsertState
Hi, (David, see bottom if you're otherwise not interested). On 2019-06-07 09:48:29 -0400, Robert Haas wrote: > On Sat, Jun 1, 2019 at 3:20 PM Andres Freund wrote: > > > I'd like to think that the best way to deal with that and reduce the > > > confusion would be to move anything related to bulk inserts into their > > > own header/file, meaning the following set: > > > - ReleaseBulkInsertStatePin > > > - GetBulkInsertState > > > - FreeBulkInsertState > > > There is the argument that we could also move that part into tableam.h > > > itself though as some of the rather generic table-related callbacks, > > > but that seems grotty. So I think that we could just move that stuff > > > as backend/access/common/bulkinsert.c. > > > > Yea, I think we should do that at some point. But I'm not sure this is > > the right design. Bulk insert probably needs to rather be something > > that's allocated inside the AM. > > As far as I can see, any on-disk, row-oriented, block-based AM is > likely to want the same implementation as the heap. I'm pretty doubtful about that. It'd e.g. would make a ton of sense to keep the VM pinned, even for heap. You could also do a lot better with toast. And for zheap we'd - unless we change the design - quite possibly benefit from keeping the last needed tpd buffer around. > Here's a draft design for adding some abstraction, roughly modeled on > the abstraction Andres added for TupleTableSlots: Hm, I'm not sure I see the need for a vtable based approach here. Won't every AM know exactly what they need / have? I'm not convinced it's worthwhile to treat that separately from the tableam. I.e. have a BulkInsertState struct with *no* members, and then, as you suggest: > > 3. table AM gets a new member BulkInsertState > *(*create_bistate)(Relation Rel) and a corresponding function > table_create_bistate(), analogous to table_create_slot(), which can > call the constructor function for the appropriate type of > BulkInsertState and return the result but also route the following through the AM: > 2. that structure has a member for each defined operation on a > BulkInsertState: > > void (*free)(BulkInsertState *); > void (*release_pin)(BulkInsertState *); // maybe rename to make it more > generic Where free would just be part of finish_bulk_insert, and release_pin a new callback. > 4. each type of BulkInsertState has its own functions for making use > of it, akin to ReadBufferBI. Right, I don't think that's avoidable unfortunately. > I see Michael's point about the relationship between > finish_bulk_insert() and the BulkInsertState, and maybe if we could > figure that out we could avoid the need for a BulkInsertState to have > a free method (or maybe any methods at all, in which case it could > just be an opaque struct, like a Node). Right, so we actually eneded up at the same place. And you found a bug: > However, it looks to me as though copy.c can create a bunch of > BulkInsertStates but only call finish_bulk_insert() once, so unless > that's a bug in need of fixing I don't quite see how to make that > approach work. That is a bug. Not a currently "active" one with in-core AMs (no dangerous bulk insert flags ever get set for partitioned tables), but we obviously need to fix it nevertheless. Robert, seems we'll have to - regardless of where we come down on fixing this bug - have to make copy use multiple BulkInsertState's, even in the CIM_SINGLE (with proute) case. Or do you have a better idea? David, any opinions on how to best fix this? It's not extremely obvious how to do so best in the current setup of the partition actually being hidden somewhere a few calls away (i.e. the table_open happening in ExecInitPartitionInfo()). Greetings, Andres Freund
tableam: inconsistent parameter name
TableAmRoutine's index_build_range_scan thinks that parameter #8 is called end_blockno, but table_index_build_range_scan and heapam_index_build_range_scan and BRIN's summarize_range all agree that it's the number of blocks to be scanned. I assume that this got changed at some point while Andres was hacking on it and this one place just never got updated. Proposed patch attached. Since this seems like a bug, albeit a harmless one, I propose to commit this to v12. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-tableam-Fix-index_build_range_scan-parameter-name.patch Description: Binary data
Re: tableam: abstracting relation sizing code
On 2019-Jun-07, Robert Haas wrote: > On Fri, Jun 7, 2019 at 8:43 AM Robert Haas wrote: > > Good catch, and now I notice that the callback is not called > > estimate_rel_size but relation_estimate_size. Updated patch attached; > > thanks for the review. > > Let's try that one more time, and this time perhaps I'll make it compile. It looks good to me, passes tests. This version seems to introduce a warning in my build: /pgsql/source/master/src/backend/access/table/tableam.c: In function 'table_block_relation_estimate_size': /pgsql/source/master/src/backend/access/table/tableam.c:633:12: warning: implicit declaration of function 'rint' [-Wimplicit-function-declaration] *tuples = rint(density * (double) curpages); ^~~~ /pgsql/source/master/src/backend/access/table/tableam.c:633:12: warning: incompatible implicit declaration of built-in function 'rint' /pgsql/source/master/src/backend/access/table/tableam.c:633:12: note: include '' or provide a declaration of 'rint' -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: tableam: abstracting relation sizing code
Hi, On 2019-06-07 08:32:37 -0400, Robert Haas wrote: > On Thu, Jun 6, 2019 at 10:08 PM Michael Paquier wrote: > > "allvisfrac", "pages" and "tuples" had better be documented about > > which result they represent. > > A lot of the table AM stuff (and the related slot stuff) lacks > function header comments; I don't like that and think it should be > improved. However, that's not the job of this patch. I think it's > completely correct for this patch to document, as it does, that the > arguments have the same meaning as for the estimate_rel_size method, > and leave it at that. There is certainly negative value in duplicating > the definitions in multiple places, where they might get out of sync. > The header comment for table_relation_estimate_size() refers the > reader to the comments for estimate_rel_size(), which says: Note that these function ended up that way by precisely this logic... ;) Greetings, Andres Freund
Re: tableam: abstracting relation sizing code
On Fri, Jun 7, 2019 at 8:43 AM Robert Haas wrote: > Good catch, and now I notice that the callback is not called > estimate_rel_size but relation_estimate_size. Updated patch attached; > thanks for the review. Let's try that one more time, and this time perhaps I'll make it compile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v3-0001-tableam-Provide-helper-functions-for-relation-siz.patch Description: Binary data
Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"
On 6/7/19 9:00 PM, Michael Paquier wrote: On Fri, Jun 07, 2019 at 03:44:14PM +0900, Masahiko Sawada wrote: BTW while looking GUC variables defined in trgm_op.c the operators in each short description seems not correct; there is an extra percent sign. Should we also fix them? Both of you are right here I did notice the double percent signs but my brain skipped over them assuming they were translatable strings, thanks for catching that. and the addition documentation looks fine to me (except the indentation). The indentation in the additional documentation seems fine to me, it's the section for the preceding GUC which is offset one column to the right. Patch attached for that. > The fix for the parameter descriptions can be back-patched safely as they > would reload correctly once the version is updated. Yup, they would appear the first time one of the pg_trgm functions is called in a session after the new object file is installed. > Or is that not worth bothering except on HEAD? Thoughts? Personally I don't think it's that critical, but not bothered either way. Presumably no-one has complained so far anyway (I only chanced upon the missing GUC description because I was poking about looking for examples of custom GUC handling...) Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml index 83b0033d7a..e353ecc954 100644 --- a/doc/src/sgml/pgtrgm.sgml +++ b/doc/src/sgml/pgtrgm.sgml @@ -320,23 +320,23 @@ - - - pg_trgm.word_similarity_threshold (real) - - -pg_trgm.word_similarity_threshold configuration parameter - - - - - - Sets the current word similarity threshold that is used by - <% and %> operators. The threshold - must be between 0 and 1 (default is 0.6). - - - + + + pg_trgm.word_similarity_threshold (real) + + + pg_trgm.word_similarity_threshold configuration parameter + + + + + + Sets the current word similarity threshold that is used by + <% and %> operators. The threshold + must be between 0 and 1 (default is 0.6). + + +
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Fri, Jun 7, 2019 at 10:17 AM Tomas Vondra wrote: > Yes, they could get quite big, and I think you're right we need to > keep that in mind, because it's on the outer (often quite large) side of > the join. And if we're aiming to restrict memory usage, it'd be weird to > just ignore this. > > But I think Thomas Munro originally proposed to treat this as a separate > BufFile, so my assumption was each worker would simply rewrite the bitmap > repeatedly for each hash table fragment. That means a bit more I/O, but as > those files are buffered and written in 8kB pages, with just 1 bit per > tuple. I think that's pretty OK and way cheaper that rewriting the whole > batch, where each tuple can be hundreds of bytes. Yes, this is also my thought. I'm not 100% sure I understand Melanie's proposal, but I think that it involves writing every still-unmatched outer tuple for every inner batch. This proposal -- assuming we can get the tuple numbering worked out -- involves writing a bit for every outer tuple for every inner batch. So each time you do an inner batch, you write either (a) one bit for EVERY outer tuple or (b) the entirety of each unmatched tuple. It's possible for the latter to be cheaper if the number of unmatched tuples is really, really tiny, but it's not very likely. For example, suppose that you've got 4 batches and each batch matches 99% of the tuples, which are each 50 bytes wide. After each batch, approach A writes 1 bit per tuple, so a total of 4 bits per tuple after 4 batches. Approach B writes a different amount of data after each batch. After the first batch, it writes 1% of the tuples, and for each one written it writes 50 bytes, so it writes 50 bytes * 0.01 = ~4 bits/tuple. That's already equal to what approach A wrote after all 4 batches, and it's going to do a little more I/O over the course of the remaining matches - although not much, because the unmatched tuples file will be very very tiny after we eliminate 99% of the 1% that survived the first batch. However, these are extremely favorable assumptions for approach B. If the tuples are wider or the batches match only say 20% of the tuples, approach B is going to be wy more I/O. Assuming I understand correctly, which I may not. > Also, it does not require any concurrency control, which rewriting the > batches themselves probably does (because we'd be feeding the tuples into > some shared file, I suppose). Except for the final step when we need to > merge the bitmaps, of course. I suppose that rewriting the batches -- or really the unmatched tuples file -- could just use a SharedTuplestore, so we probably wouldn't need a lot of new code for this. I don't know whether contention would be a problem or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman wrote: > I'm not sure I understand why you would need to compare the original > tuples to the unmatched tuples file. I think I was confused. Actually, I'm still not sure I understand this part: > Then, you iterate again through the outer side a third time to join it > to the unmatched tuples in the unmatched tuples file (from the first > chunk) and write the following to a new unmatched tuples file: > 5 > 9 > 11 > 11 and likewise here > Then you iterate a fifth time through the outer side to join it to the > unmatched tuples in the unmatched tuples file (from the second chunk) > and write the following to a new unmatched tuples file: > 11 > 11 So you refer to joining the outer side to the unmatched tuples file, but how would that tell you which outer tuples had no matches on the inner side? I think what you'd need to do is anti-join the unmatched tuples file to the current inner batch. So the algorithm would be something like: for each inner batch: for each outer tuple: if tuple matches inner batch then emit match if tuple does not match inner batch and this is the first inner batch: write tuple to unmatched tuples file if this is not the first inner batch: for each tuple from the unmatched tuples file: if tuple does not match inner batch: write to new unmatched tuples file discard previous unmatched tuples file and use the new one for the next iteration for each tuple in the final unmatched tuples file: null-extend and emit If that's not what you have in mind, maybe you could provide some similar pseudocode? Or you can just ignore me. I'm not trying to interfere with an otherwise-fruitful discussion by being the only one in the room who is confused... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Thu, Jun 06, 2019 at 04:33:31PM -0700, Melanie Plageman wrote: On Tue, Jun 4, 2019 at 12:15 PM Robert Haas wrote: On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman wrote: > One question I have is, how would the OR'd together bitmap be > propagated to workers after the first chunk? That is, when there are > no tuples left in the outer bunch, for a given inner chunk, would you > load the bitmaps from each worker into memory, OR them together, and > then write the updated bitmap back out so that each worker starts with > the updated bitmap? I was assuming we'd elect one participant to go read all the bitmaps, OR them together, and generate all the required null-extended tuples, sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING, PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only involve one participant being active at a time. Now you could hope for something better -- why not parallelize that work? But on the other hand, why not start simple and worry about that in some future patch instead of right away? A committed patch that does something good is better than an uncommitted patch that does something AWESOME. What if you have a lot of tuples -- couldn't the bitmaps get pretty big? And then you have to OR them all together and if you can't put the whole bitmap from each worker into memory at once to do it, it seems like it would be pretty slow. (I mean maybe not as slow as reading the outer side 5 times when you only have 3 chunks on the inner + all the extra writes from my unmatched tuple file idea, but still...) Yes, they could get quite big, and I think you're right we need to keep that in mind, because it's on the outer (often quite large) side of the join. And if we're aiming to restrict memory usage, it'd be weird to just ignore this. But I think Thomas Munro originally proposed to treat this as a separate BufFile, so my assumption was each worker would simply rewrite the bitmap repeatedly for each hash table fragment. That means a bit more I/O, but as those files are buffered and written in 8kB pages, with just 1 bit per tuple. I think that's pretty OK and way cheaper that rewriting the whole batch, where each tuple can be hundreds of bytes. Also, it does not require any concurrency control, which rewriting the batches themselves probably does (because we'd be feeding the tuples into some shared file, I suppose). Except for the final step when we need to merge the bitmaps, of course. So I think this would work, it does not have the issue with using too much memory, and I don't think the overhead is too bad. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Thu, Jun 06, 2019 at 04:37:19PM -0700, Melanie Plageman wrote: On Thu, May 16, 2019 at 3:22 PM Thomas Munro wrote: Admittedly I don't have a patch, just a bunch of handwaving. One reason I haven't attempted to write it is because although I know how to do the non-parallel version using a BufFile full of match bits in sync with the tuples for outer joins, I haven't figured out how to do it for parallel-aware hash join, because then each loop over the outer batch could see different tuples in each participant. You could use the match bit in HashJoinTuple header, but then you'd have to write all the tuples out again, which is more IO than I want to do. I'll probably start another thread about that. Going back to the idea of using the match bit in the HashJoinTuple header and writing out all of the outer side for every chunk of the inner side, I was wondering if there was something we could do that was kind of like mmap'ing the outer side file to give the workers in parallel hashjoin the ability to update a match bit in the tuple in place and avoid writing the whole outer side out each time. I think this was one of the things we discussed in Ottawa - we could pass index of the tuple (in the batch) along with the tuple, so that each worker know which bit to set. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Custom table AMs need to include heapam.h because of BulkInsertState
On Sat, Jun 1, 2019 at 3:20 PM Andres Freund wrote: > > I'd like to think that the best way to deal with that and reduce the > > confusion would be to move anything related to bulk inserts into their > > own header/file, meaning the following set: > > - ReleaseBulkInsertStatePin > > - GetBulkInsertState > > - FreeBulkInsertState > > There is the argument that we could also move that part into tableam.h > > itself though as some of the rather generic table-related callbacks, > > but that seems grotty. So I think that we could just move that stuff > > as backend/access/common/bulkinsert.c. > > Yea, I think we should do that at some point. But I'm not sure this is > the right design. Bulk insert probably needs to rather be something > that's allocated inside the AM. As far as I can see, any on-disk, row-oriented, block-based AM is likely to want the same implementation as the heap. Column stores might want to pin multiple buffers, and an in-memory AM might have a completely different set of requirements, but something like zheap really has no reason to depart from what the heap does. I think it's really important that new table AMs not only have the option to do something different than the heap in any particular area, but that they also have the option to do the SAME thing as the heap without having to duplicate a bunch of code. So I think it would be reasonable to start by doing some pure code movement here, along the lines proposed by Michael -- not sure if src/backend/access/common is right or if it should be src/backend/access/table -- and then add the abstraction afterwards. Worth noting is ReadBufferBI() also needs moving and is a actually a bigger problem than the functions that Michael mentioned, because the other functions are accessible if you're willing to stoop to including heap-specific headers, but that function is static and you'll have to just copy-and-paste it. Uggh. Here's a draft design for adding some abstraction, roughly modeled on the abstraction Andres added for TupleTableSlots: 1. a BulkInsertState becomes a struct whose only member is a pointer to const BulkInsertStateOps *const ops 2. that structure has a member for each defined operation on a BulkInsertState: void (*free)(BulkInsertState *); void (*release_pin)(BulkInsertState *); // maybe rename to make it more generic 3. table AM gets a new member BulkInsertState *(*create_bistate)(Relation Rel) and a corresponding function table_create_bistate(), analogous to table_create_slot(), which can call the constructor function for the appropriate type of BulkInsertState and return the result 4. each type of BulkInsertState has its own functions for making use of it, akin to ReadBufferBI. That particular function signature is only likely to be correct for something that does more-or-less what the existing type of BulkInsertState does; if you're using a column-store that pins multiple buffers or something, you'll need your own code path. But that's OK, because ReadBufferBI or whatever other functions you have are only going to get called from AM-specific code, which will know what type of BulkInsertState they have got, because they are in control of which kind of BulkInsertState gets created for their relations as per point #4, so they can just call the right functions. 5. The current implementation of BulkInsertState gets renamed to BlockBulkInsertState (or something else) and is used by heap and any AMs that like it. I see Michael's point about the relationship between finish_bulk_insert() and the BulkInsertState, and maybe if we could figure that out we could avoid the need for a BulkInsertState to have a free method (or maybe any methods at all, in which case it could just be an opaque struct, like a Node). However, it looks to me as though copy.c can create a bunch of BulkInsertStates but only call finish_bulk_insert() once, so unless that's a bug in need of fixing I don't quite see how to make that approach work. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Custom table AMs need to include heapam.h because of BulkInsertState
On Thu, Jun 6, 2019 at 10:29 PM Michael Paquier wrote: > One thing which is a bit tricky is that for example with COPY FROM we > have a code path which is able to release a buffer held by the bulk > insert state. Are you talking about the call to ReleaseBulkInsertStatePin, or something else? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: hyrax vs. RelationBuildPartitionDesc
On Thu, Jun 6, 2019 at 2:48 AM Amit Langote wrote: > Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, > which seems to fix this issue for me. Yeah, that looks right. I think my patch was full of fuzzy thinking and inadequate testing; thanks for checking it over and coming up with the right solution. Anyone else want to look/comment? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: tableam: abstracting relation sizing code
On Fri, Jun 7, 2019 at 4:11 AM Daniel Gustafsson wrote: > Makes sense. Regarding one of the hacks: > > +* HACK: if the relation has never yet been vacuumed, use a minimum > size > +* estimate of 10 pages. The idea here is to avoid assuming a > +* newly-created table is really small, even if it currently is, > because > +* that may not be true once some data gets loaded into it. > > When this is a generic function for AM’s, would it make sense to make the > minimum estimate a passed in value rather than hardcoded at 10? I don’t have > a > case in mind, but I also don’t think that assumptions made for heap > necessarily > makes sense for all AM’s. Just thinking out loud. I think that's probably going in the wrong direction. It's arguable, of course. However, it seems to me that there's nothing heap-specific about the number 10. It's not computed based on the format of a heap page or a heap tuple. It's just somebody's guess (likely Tom's) about how to plan with empty relations. If somebody finds that another number works better for their AM, it's probably also better for heap, at least on that person's workload. It seems more likely to me that this needs to be a GUC or reloption than that it needs to be an AM-specific property. It also seems to me that if the parameters of the hack randomly vary from one AM to another, it's likely to create confusing plan differences that have nothing to do with the actual merits of what the AMs are doing. But all that being said, I'm not blocking anybody from fooling around with this; nobody's obliged to use the helper function. It's just there for people who want the same AM-independent logic that the heap uses. > > Here is a patch that tries to improve the situation. I don't know > > whether there is some better approach; this seemed like the obvious > > thing to do. > > A small nitpick on the patch: > > + * estimate_rel_size callback, because it has a few additional paramters. > > s/paramters/parameters/ Good catch, and now I notice that the callback is not called estimate_rel_size but relation_estimate_size. Updated patch attached; thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v2-0001-tableam-Provide-helper-functions-for-relation-siz.patch Description: Binary data
Re: tableam: abstracting relation sizing code
On Thu, Jun 6, 2019 at 10:08 PM Michael Paquier wrote: > Looks like a neat split. Thanks. > "allvisfrac", "pages" and "tuples" had better be documented about > which result they represent. A lot of the table AM stuff (and the related slot stuff) lacks function header comments; I don't like that and think it should be improved. However, that's not the job of this patch. I think it's completely correct for this patch to document, as it does, that the arguments have the same meaning as for the estimate_rel_size method, and leave it at that. There is certainly negative value in duplicating the definitions in multiple places, where they might get out of sync. The header comment for table_relation_estimate_size() refers the reader to the comments for estimate_rel_size(), which says: * estimate_rel_size - estimate # pages and # tuples in a table or index * * We also estimate the fraction of the pages that are marked all-visible in * the visibility map, for use in estimation of index-only scans. * * If attr_widths isn't NULL, it points to the zero-index entry of the * relation's attr_widths[] cache; we fill this in if we have need to compute * the attribute widths for estimation purposes. ...which AFAICT constitutes as much documentation of these parameters as we have got. I think this is all a bit byzantine and could probably be made clearer, but (1) fortunately it's not all that hard to guess what these are supposed to mean and (2) I don't feel obliged to do semi-related comment cleanup every time I patch tableam. > Could you explain what's the use cases you have in mind for > usable_bytes_per_page? All AMs using smgr need to have a page header, > which implies that the usable number of bytes is (BLCKSZ - page > header) like heap for tuple data. In the AMs you got to work with, do > you store some extra data in the page which is not used for tuple > storage? I think that makes sense, just wondering about the use > case. Exactly. BLCKSZ - page header is probably the maximum unless you roll your own page format, but you could easily have less if you use the special space. zheap is storing transaction slots there; you could store an epoch to avoid freezing, and there's probably quite a few other reasonable possibilities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Fix inconsistencies for v12
Hello Amit, 07.06.2019 7:30, Amit Kapila wrote: > Leaving the changes related to the above two points, I have combined > all the changes and fixed the things as per my review in the attached > patch. Alexander, see if you can verify once the combined patch. I > am planning to commit the attached by tomorrow and then we can deal > with the remaining two. However, in the meantime, if Andres shared > his views on the above two points, then we can include the changes > corresponding to them as well. Amit, I agree with all of your changes. All I could is to move a dot: .. (see contrib/postgres_fdw/postgres_fdw.c: postgresBeginDirectModify() as one example). Best regards, Alexander
Re: be-gssapi-common.h should be located in src/include/libpq/
Greetings, * Michael Paquier (mich...@paquier.xyz) wrote: > As mentioned on another thread about test coverage, I have noticed > that be-gssapi-common.h is not placed at the correct location, even > its its identication path at the top points to where the file should > be: > https://www.postgresql.org/message-id/20190604014630.gh1...@paquier.xyz > > The file has been introduced at its current location as of b0b39f72. > Any objections to something like the attached? I'm pretty sure it ended up there just because that's how things are in src/interfaces/libpq. I don't have any objection to moving it, I had really just been waiting to see where that thread ended up going. On a quick look, the patch looks fine to me. Thanks, Stephen signature.asc Description: PGP signature
Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"
On Fri, Jun 07, 2019 at 03:44:14PM +0900, Masahiko Sawada wrote: > BTW while looking GUC variables defined in trgm_op.c the operators in > each short description seems not correct; there is an extra percent > sign. Should we also fix them? Both of you are right here, and the addition documentation looks fine to me (except the indentation). The fix for the parameter descriptions can be back-patched safely as they would reload correctly once the version is updated. Or is that not worth bothering except on HEAD? Thoughts? -- Michael signature.asc Description: PGP signature
Re: Small review comment on pg_checksums
On Fri, Jun 07, 2019 at 03:30:35PM +0900, Masahiko Sawada wrote: > Agreed. Please find an attached patch. Thanks, committed. -- Michael signature.asc Description: PGP signature
Re: PG 12 draft release notes
On 5/11/19 10:33 PM, Bruce Momjian wrote: > I have posted a draft copy of the PG 12 release notes here: > > http://momjian.us/pgsql_docs/release-12.html > > They are committed to git. It includes links to the main docs, where > appropriate. Our official developer docs will rebuild in a few hours. > Hello, By looking to a user request to add greek in our FTS [1], I suggest to mention which languages has been added in fd582317e. Patch attached. I hesitate to also mention these changes? > These all work in UTF8, and the indonesian and irish ones also work in LATIN1. > The non-UTF8 version of the hungarian stemmer now works in LATIN2 not LATIN1. 1: https://www.postgresql.org/message-id/trinity-f09793a1-8c13-4b56-94fe-10779e96c87e-1559896268438%403c-app-mailcom-bs16 Cheers, -- Adrien diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml index 532055456c..c50a23bd1c 100644 --- a/doc/src/sgml/release-12.sgml +++ b/doc/src/sgml/release-12.sgml @@ -2085,8 +2085,8 @@ Author: Tom Lane --> - Update Snowball stemmer dictionaries with support for new languages - (Arthur Zakirov) + Update Snowball stemmer dictionaries with support for new languages: + arabic, indonesian, irish, lithuanian, nepali and tami. (Arthur Zakirov) signature.asc Description: OpenPGP digital signature
Missing generated column in ALTER TABLE ADD COLUMN doc
Hi, We support ALTER TABLE ADD COLUMN .. GENERATED ALWAYS AS .. but the doc is missing it. Attached small patch fixes this. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center doc_alter_column_add_generated_column.patch Description: Binary data
Re: tableam: abstracting relation sizing code
> On 6 Jun 2019, at 22:40, Robert Haas wrote: > It looks to me as though any table AM that uses the relation forks > supported by PostgreSQL in a more or less normal manner is likely to > require an implementation of the relation_size callback that is > identical to the one for heap, and an implementation of the > estimate_rel_size method that is extremely similar to the one for > heap. The latter is especially troubling as the amount of code > duplication is non-trivial, and it's full of special hacks. Makes sense. Regarding one of the hacks: +* HACK: if the relation has never yet been vacuumed, use a minimum size +* estimate of 10 pages. The idea here is to avoid assuming a +* newly-created table is really small, even if it currently is, because +* that may not be true once some data gets loaded into it. When this is a generic function for AM’s, would it make sense to make the minimum estimate a passed in value rather than hardcoded at 10? I don’t have a case in mind, but I also don’t think that assumptions made for heap necessarily makes sense for all AM’s. Just thinking out loud. > Here is a patch that tries to improve the situation. I don't know > whether there is some better approach; this seemed like the obvious > thing to do. A small nitpick on the patch: + * estimate_rel_size callback, because it has a few additional paramters. s/paramters/parameters/ cheers ./daniel
Wording variations in descriptions of gucs.
Hello. In guc.c many of the variables are described as "Set_s_ something" as if the variable name is the omitted subject. A few seem being wrongly written as "Set something" with the same intention. Is it useful to unify them to the majority? wal_level > gettext_noop("Set the level of information written to the WAL."), log_transaction_sample_rage > gettext_noop("Set the fraction of transactions to log for new transactions."), Though, recovery_target seems written as intended. > gettext_noop("Set to 'immediate' to end recovery as soon as a consistent > state is reached. # rather it seems to be the detaied description.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center guc_desc_fix.patch Description: Binary data
Re: be-gssapi-common.h should be located in src/include/libpq/
> On 7 Jun 2019, at 06:34, Michael Paquier wrote: > Any objections to something like the attached? No objections to moving the file per the patch. While looking at libpq.h I noticed what seems to be a few nitpicks: the GSS function prototype isn’t using the common format of having a comment specifying the file it belongs to; ssl_loaded_verify_locations is defined as extern even though it’s only available under USE_SSL (which works fine since it’s only accessed under USE_SSL but seems kinda wrong); and FeBeWaitSet is not listed under the pqcomm.c prototypes like how the extern vars from be-secure.c are. All of these are in the attached. cheers ./daniel libpq_reorg.diff Description: Binary data
Re: Should we warn against using too many partitions?
Hi, Thanks for the updated patches. On Fri, Jun 7, 2019 at 2:34 PM David Rowley wrote: > Anyway comments welcome. If I had a few more minutes to spare I'd > have wrapped OLTP in tags, but out of time for now. Some rewording suggestions. 1. +...Removal of unwanted data is also a factor to consider when +planning your partitioning strategy as an entire partition can be removed +fairly quickly. However, if data that you want to keep exists in that +partition then that means having to resort to using +DELETE instead of removing the partition. Not sure if the 2nd sentence is necessary or perhaps should be rewritten in a way that helps to design to benefit from this. Maybe: ...Removal of unwanted data is also a factor to consider when planning your partitioning strategy as an entire partition can be removed fairly quickly, especially if the partition keys are chosen such that all data that can be deleted together are grouped into separate partitions. 2. +... For example, if you choose to have one partition +per customer and you currently have a small number of large customers, +what will the implications be if in several years you obtain a large +number of small customers. The sentence could be rewritten a bit. Maybe as: ... For example, choosing a design with one partition per customer, because you currently have a small number of large customers, will not scale well several years down the line when you might have a large number of small customers. Btw, doesn't it suffice here to say "large number of customers" instead of "large number of small customers"? 3. +... In this case, it may be better to choose to +partition by RANGE and choose a reasonable number of +partitions Maybe: ... and choose reasonable number of partitions, each containing the data of a fixed number of customers. 4. +... It also +may be undesirable to have a large number of partitions as each partition +requires metadata about the partition to be stored in each session that +touches it. If each session touches a large number of partitions over a +period of time then the memory consumption for this may become +significant. It might be a good idea to reorder the sentences here to put the problem first and the cause later. Maybe like this: Another reason to be concerned about having a large number of partitions is that the server's memory consumption may grow significantly over a period of time, especially if many sessions touch large numbers of partitions. That's because each partition requires its own metadata that must be loaded into the local memory of each session that touches it. 5. +With data warehouse type workloads it can make sense to use a larger +number of partitions than with an OLTP type workload. Is there a comma missing between "With data warehouse type workloads" and the rest of the sentence? Thanks, Amit