Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
On Fri, Jul 24, 2020 at 8:07 PM Robert Haas wrote: > > On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy > wrote: > > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1]. > > > > It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit(). > > If it doesn't involve shared memory, I guess it can be on_proc_exit() > rather than on_shmem_exit(). > > I guess the other question is why we're doing it at all. What > resources are being allocated that wouldn't be freed up by process > exit anyway? > LLVMOrcCreateInstance() and LLVMOrcDisposeInstance() are doing new and delete respectively, I just found these functions from the link [1]. But I don't exactly know whether there are any other resources being allocated that can't be freed up by proc_exit(). Tagging @Andres Freund for inputs on whether we have any problem making llvm_shutdown() a on_proc_exit() callback instead of before_shmem_exit() callback. And as suggested in the previous mails, we wanted to make it on_proc_exit() to avoid the abort issue reported in this mail chain, however if we take the abort issue fix commit # 303640199d0436c5e7acdf50b837a027b5726594 as mentioned in the previous response[2], then it may not be necessary, right now, but just to be safer and to avoid any of these similar kind of issues in future, we can consider this change as well. [1] - https://llvm.org/doxygen/OrcCBindings_8cpp_source.html LLVMOrcJITStackRef LLVMOrcCreateInstance(LLVMTargetMachineRef TM) { TargetMachine *TM2(unwrap(TM)); Triple T(TM2->getTargetTriple()); auto IndirectStubsMgrBuilder = orc::createLocalIndirectStubsManagerBuilder(T); OrcCBindingsStack *JITStack = new OrcCBindingsStack(*TM2, std::move(IndirectStubsMgrBuilder)); return wrap(JITStack); } LLVMErrorRef LLVMOrcDisposeInstance(LLVMOrcJITStackRef JITStack) { auto *J = unwrap(JITStack); auto Err = J->shutdown(); delete J; return wrap(std::move(Err)); } [2] - https://www.postgresql.org/message-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: autovac issue with large number of tables
On Mon, 27 Jul 2020 at 06:43, Nasby, Jim wrote: > > A database with a very large number of tables eligible for autovacuum can > result in autovacuum workers “stuck” in a tight loop of > table_recheck_autovac() constantly reporting nothing to do on the table. This > is because a database with a very large number of tables means it takes a > while to search the statistics hash to verify that the table still needs to > be processed[1]. If a worker spends some time processing a table, when it’s > done it can spend a significant amount of time rechecking each table that it > identified at launch (I’ve seen a worker in this state for over an hour). A > simple work-around in this scenario is to kill the worker; the launcher will > quickly fire up a new worker on the same database, and that worker will build > a new list of tables. > > > > That’s not a complete solution though… if the database contains a large > number of very small tables you can end up in a state where 1 or 2 workers is > busy chugging through those small tables so quickly than any additional > workers spend all their time in table_recheck_autovac(), because that takes > long enough that the additional workers are never able to “leapfrog” the > workers that are doing useful work. > As another solution, I've been considering adding a queue having table OIDs that need to vacuumed/analyzed on the shared memory (i.g. on DSA). Since all autovacuum workers running on the same database can see a consistent queue, the issue explained above won't happen and probably it makes the implementation of prioritization of tables being vacuumed easier which is sometimes discussed on pgsql-hackers. I guess it might be worth to discuss including this idea. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reigning in ExecParallelHashRepartitionFirst
On Thu, Jul 9, 2020 at 8:17 AM Melanie Plageman wrote: > Last week as I was working on adaptive hash join [1] and trying to get > parallel adaptive hash join batch 0 to spill correctly, I noticed what > seemed like a problem with the code to repartition batch 0. > > If we run out of space while inserting tuples into the hashtable during > the build phase of parallel hash join and proceed to increase the number > of batches, we need to repartition all of the tuples from the old > generation (when nbatch was x) and move them to their new homes in the > new generation (when nbatch is 2x). Before we do this repartitioning we > disable growth in the number of batches. > > Then we repartition the tuples from the hashtable, inserting them either > back into the hashtable or into a batch file. While inserting them into > the hashtable, we call ExecParallelHashTupleAlloc(), and, if there is no > space for the current tuple in the current chunk and growth in the > number of batches is disabled, we go ahead and allocate a new chunk of > memory -- regardless of whether or not we will exceed the space limit. Hmm. It shouldn't really be possible for ExecParallelHashRepartitionFirst() to run out of memory anyway, considering that the input of that operation previously fit (just... I mean we started repartitioning because one more chunk would have pushed us over the edge, but the tuples so far fit, and we'll insert them in the same order for each input chunk, possibly filtering some out). Perhaps you reached this condition because batches[0].shared->size finishes up accounting for the memory used by the bucket array in PHJ_GROW_BUCKETS_ELECTING, but didn't originally account for it in generation 0, so what previously appeared to fit no longer does :-(. I'll look into that.
Re: Global snapshots
On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote: Hi Andrey san, Movead san, From: tsunakawa.ta...@fujitsu.com While Clock-SI seems to be considered the best promising for global serializability here, * Why does Clock-SI gets so much attention? How did Clock-SI become the only choice? * Clock-SI was devised in Microsoft Research. Does Microsoft or some other organization use Clock-SI? Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI."" US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents https://patents.google.com/patent/US8356007 If it is, can we circumvent this patent? Regards Takayuki Tsunakawa Thank you for the research (and previous links too). I haven't seen this patent before. This should be carefully studied. -- regards, Andrey Lepikhov Postgres Professional
RE: Global snapshots
Hi Andrey san, Movead san, From: tsunakawa.ta...@fujitsu.com > While Clock-SI seems to be considered the best promising for global > serializability here, > > * Why does Clock-SI gets so much attention? How did Clock-SI become the > only choice? > > * Clock-SI was devised in Microsoft Research. Does Microsoft or some other > organization use Clock-SI? Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI."" US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents https://patents.google.com/patent/US8356007 If it is, can we circumvent this patent? Regards Takayuki Tsunakawa
Re: proposal: possibility to read dumped table's name from file
ne 26. 7. 2020 v 21:10 odesílatel Justin Pryzby napsal: > On Sat, Jul 25, 2020 at 06:56:31PM +0530, vignesh C wrote: > > On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule > wrote: > > >> I meant can this: > > >> printf(_(" --filter=FILENAMEread object name filter > > >> expressions from file\n")); > > >> be changed to: > > >> printf(_(" --filter=FILENAMEdump objects and data based > > >> on the filter expressions from the filter file\n")); > > > > > > done in today patch > > > > Thanks for fixing the comments. > > Few comments: > > + /* use "-" as symbol for stdin */ > > + if (strcmp(filename, "-") != 0) > > + { > > + fp = fopen(filename, "r"); > > + if (!fp) > > + fatal("could not open the input file \"%s\": %m", > > + filename); > > + } > > + else > > + fp = stdin; > > > > We could use STDIN itself instead of -, it will be a more easier > > option to understand. > > I think "-" is used widely for commandline tools, and STDIN is not (even > though > it's commonly used by programmers). For example, since last year, > pg_restore > -f - means stdout. > yes, STDIN is used by programming languages, but it is not usual in command line tools. And because it was used by pg_restore, then we should not use new inconsistency. Regards Pavel > -- > Justin >
Re: proposal: possibility to read dumped table's name from file
so 25. 7. 2020 v 15:26 odesílatel vignesh C napsal: > On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule > wrote: > >> I meant can this: > >> printf(_(" --filter=FILENAMEread object name filter > >> expressions from file\n")); > >> be changed to: > >> printf(_(" --filter=FILENAMEdump objects and data based > >> on the filter expressions from the filter file\n")); > > > > done in today patch > > > > Thanks for fixing the comments. > Few comments: > + /* use "-" as symbol for stdin */ > + if (strcmp(filename, "-") != 0) > + { > + fp = fopen(filename, "r"); > + if (!fp) > + fatal("could not open the input file \"%s\": %m", > + filename); > + } > + else > + fp = stdin; > > We could use STDIN itself instead of -, it will be a more easier > option to understand. > > + /* when first char is hash, ignore whole line */ > + if (*line == '#') > + continue; > > If line starts with # we ignore that line, I feel this should be > included in the documentation. > Good note - I wrote sentence to doc + +The lines starting with symbol # are ignored. +Previous white chars (spaces, tabs) are not allowed. These +lines can be used for comments, notes. + + > Regards, > Vignesh > EnterpriseDB: http://www.enterprisedb.com > diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 7a37fd8045..0e0fbc90db 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -755,6 +755,56 @@ PostgreSQL documentation + + --filter=filename + + +Read objects filters from the specified file. +If you use "-" as a filename, the filters are read from stdin. +The lines of this file must have the following format: + +(+|-)[tnfd] objectname + + + + +The first character specifies whether the object is to be included +(+) or excluded (-), and the +second character specifies the type of object to be filtered: +t (table), +n (schema), +f (foreign table), +d (table data). + + + +With the following filter file, the dump would include table +mytable1 and data from foreign table +some_foreign_table, but exclude data +from table mytable2. + ++t mytable1 ++f some_foreign_table +-d mytable2 + + + + +The lines starting with symbol # are ignored. +Previous white chars (spaces, tabs) are not allowed. These +lines can be used for comments, notes. + + + +The --filter option works just like the other +options to include or exclude tables, schemas, table data, or foreign +tables, and both forms may be combined. Note that there are no options +to exclude a specific foreign table or to include a specific table's +data. + + + + --if-exists diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 94459b3539..8df1f91cb6 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -290,6 +290,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, static char *get_synchronized_snapshot(Archive *fout); static void setupDumpWorker(Archive *AHX); static TableInfo *getRootTableInfo(TableInfo *tbinfo); +static void read_patterns_from_file(char *filename, DumpOptions *dopt); int @@ -364,6 +365,7 @@ main(int argc, char **argv) {"enable-row-security", no_argument, &dopt.enable_row_security, 1}, {"exclude-table-data", required_argument, NULL, 4}, {"extra-float-digits", required_argument, NULL, 8}, + {"filter", required_argument, NULL, 12}, {"if-exists", no_argument, &dopt.if_exists, 1}, {"inserts", no_argument, NULL, 9}, {"lock-wait-timeout", required_argument, NULL, 2}, @@ -603,6 +605,10 @@ main(int argc, char **argv) optarg); break; + case 12: /* filter implementation */ +read_patterns_from_file(optarg, &dopt); +break; + default: fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); exit_nicely(1); @@ -1022,6 +1028,8 @@ help(const char *progname) " access to)\n")); printf(_(" --exclude-table-data=PATTERN do NOT dump data for the specified table(s)\n")); printf(_(" --extra-float-digits=NUM override default setting for extra_float_digits\n")); + printf(_(" --filter=FILENAMEdump objects and data based on the filter expressions\n" + " from the filter file\n")); printf(_(" --if-exists use IF EXISTS when dropping objects\n")); printf(_(" --include-foreign-data=PATTERN\n" " include data of foreign tables on foreign\n" @@ -18434,3 +18442,211 @@ appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions, if (!res) pg_log_warning("could not parse reloptions array")
Re: Parallel worker hangs while handling errors.
On Sat, Jul 25, 2020 at 7:02 AM vignesh C wrote: > > I have made slight changes on top of the patch to remove duplicate > code, attached v3 patch for the same. > The parallel worker hang issue gets resolved, make check & make > check-world passes. > Having a function to unblock selective signals for a bg worker looks good to me. Few comments: 1. Do we need "worker" as a function argument in update_parallel_worker_sigmask(BackgroundWorker *worker, ? Since MyBgworkerEntry is a global variable, can't we have a local variable instead? 2. Instead of update_parallel_worker_sigmask() serving only for parallel workers, can we make it generic, so that for any bgworker, given a signal it unblocks it, although there's no current use case for a bg worker unblocking a single signal other than a parallel worker doing it for SIGUSR1 for this hang issue. Please note that we have BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals(). I slightly modified your function, something like below? void BackgroundWorkerUpdateSignalMask(int signum, bool toblock) { if (toblock) sigaddset(&BlockSig, signum); else sigdelset(&BlockSig, signum); PG_SETMASK(&BlockSig); } /*to unblock SIGUSR1*/ if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerUpdateSignalMask(SIGUSR1, false); /*to block SIGUSR1*/ if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0) BackgroundWorkerUpdateSignalMask(SIGUSR1, true); If okay, with the BackgroundWorkerUpdateSignalMask() function, please note that we might have to add it in bgworker.sgml as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
Hi, Thanks for sharing your thoughts. Please find my comments inline below: > > I think here we should report that we haven't done what was asked. > + /* Nothing to do if the itemid is unused or > already dead. */ > + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > + continue; > > Okay. Will add a log message saying "skipping tid ... because ..." > Also, should we try to fix VM along the way? > I think we should let VACUUM do that. > Are there any caveats with concurrent VACUUM? (I do not see any, just > asking) > As of now, I don't see any. > It would be good to have some checks for interrupts in safe places. > I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out. > I think we should not trust user entierly here. I'd prefer validation and > graceful exit, not a core dump. > + Assert(noffs <= PageGetMaxOffsetNumber(page)); > > Yeah, sounds reasonable. Will do that. > For some reason we had unlogged versions of these functions. But I do not > recall exact rationale.. > Also, I'd be happy if we had something like "Restore this tuple iff this > does not break unique constraint". To do so we need to sort tids by > xmin\xmax, to revive most recent data first. > I didn't get this point. Could you please elaborate. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: new heapcheck contrib module
On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger wrote: > [] > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == > > FirstOffsetNumber, > > +"InvalidOffsetNumber > > increments to FirstOffsetNumber"); > > > > If you are going to rely on this property, I agree that it is good to > > check it. But it would be better to NOT rely on this property, and I > > suspect the code can be written quite cleanly without relying on it. > > And actually, that's what you did, because you first set ctx.offnum = > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > the loop initializer. So AFAICS the first initializer, and the static > > assert, are pointless. > > Ah, right you are. Removed. > I can see the same assert and the unnecessary assignment in v12-0002, is that the same thing that is supposed to be removed, or am I missing something? > [] > > +confess(HeapCheckContext * ctx, char *msg) > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > This is what happens when you pgindent without adding all the right > > things to typedefs.list first ... or when you don't pgindent and have > > odd ideas about how to indent things. > > Hmm. I don't see the three lines of code you are quoting. Which patch is > that from? > I think it was the same thing related to my previous suggestion to list new structures to typedefs.list. V12 has listed new structures but I think there are still some more adjustments needed in the code e.g. see space between HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the pgindent will do that or not. Here are a few more minor comments for the v12-0002 patch & some of them apply to other patches as well: #include "utils/snapmgr.h" - +#include "amcheck.h" Doesn't seem to be at the correct place -- need to be in sorted order. + if (!PG_ARGISNULL(3)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("starting block " INT64_FORMAT + " is out of bounds for relation with no blocks", + PG_GETARG_INT64(3; + if (!PG_ARGISNULL(4)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("ending block " INT64_FORMAT + " is out of bounds for relation with no blocks", + PG_GETARG_INT64(4; I think these errmsg() strings also should be in one line. + if (fatal) + { + if (ctx.toast_indexes) + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes, + ShareUpdateExclusiveLock); + if (ctx.toastrel) + table_close(ctx.toastrel, ShareUpdateExclusiveLock); Toast index and rel closing block style is not the same as at the ending of verify_heapam(). + /* If we get this far, we know the relation has at least one block */ + startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3); + endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4); + if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT + " is out of bounds for relation with block count %u", + startblock, endblock, ctx.nblocks))); + ... ... + if (startblock < 0) + startblock = 0; + if (endblock < 0 || endblock > ctx.nblocks) + endblock = ctx.nblocks; Other than endblock < 0 case, do we really need that? I think due to the above error check the rest of the cases will not reach this place. + confess(ctx, psprintf( + "tuple xmax %u follows last assigned xid %u", + xmax, ctx->nextKnownValidXid)); + fatal = true; + } + } + + /* Check for tuple header corruption */ + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) + { + confess(ctx, + psprintf("tuple's header size is %u bytes which is less than the %u byte minimum valid header size", + ctx->tuphdr->t_hoff, + (unsigned) SizeofHeapTupleHeader)); confess() call has two different code styles, first one where psprintf()'s only argument got its own line and second style where psprintf has its own line with the argument. I think the 2nd style is what we do follow & correct, not the former. + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not a heap", + RelationGetRelationName(rel; Like elsewhere, can we have errmsg as "only heap AM is supported" and error code is ERRCODE_FEATURE_NOT_SUPPORTED ? That all, for now, apologize for multiple review emails. Regards, Amul
Re: Loaded footgun open_datasync on Windows
On Thu, Jul 23, 2020 at 01:05:04PM -0400, Jeff Janes wrote: > I have noticed this before, but since it wasn't a production machine I just > shrugged it off as being a hazard of using consumer-grade stuff; it didn't > seem to be worth investigating further. The most direct and non-invasive way to address this problem in back-branches would be to copy the non-concurrent logic in src/port/open.c directly into pg_test_fsync.c. We have done that in the past such things for example with restricted tokens on Windows for pg_upgrade, see fa1e5afa but that was much more important. I am not sure that if it is worth the time spent here though, as it took roughly 10 years to notice the problem (and I don't really count the time where the tool was under src/tools/ but this did not build the tool on Windows). -- Michael signature.asc Description: PGP signature
Re: Allows Extend Protocol support CURSOR_OPT_HOLD with prepared stmt.
> > > 2. Currently I want to add a new GUC parameter, if set it to true, server > will > create a holdable portal, or else nothing changed. Then let the user set > it to true in the above case and reset it to false afterward. Is there > any issue > with this method? > > I forget to say in this case, the user has to drop the holdable portal explicitly. -- Best Regards Andy Fan
Allows Extend Protocol support CURSOR_OPT_HOLD with prepared stmt.
I have a user case like this: rs = prepared_stmt.execute(1); while(rs.next()) { // do something with the result and commit the transaction. conn.commit(); } The driver used the extended protocol in this case. It works like this: 1). Parse -> PreparedStmt. 2). Bind -> Bind the prepared stmt with a Portal, no chance to set the CURSOR_OPT_HOLD option. 3). Execute. 4). Commit - the portal was dropped at this stage. 5). when fetching the next batch of results, we get the error "Portal doesn't exist" There are several methods we can work around this, but no one is perfect. 1.run the prepared stmt in a dedicated connection. (The number of connection will doubled) 2. use the with hold cursor. It doesn't support any bind parameter, so we have to create a cursor for each dedicated id. 3. don't commit the transaction. -- long transaction with many rows locked. I have several questions about this case: 1. How about filling a cursorOptions information in bind protocol? then we can set the portal->cursorOptions accordingly? if so, how to be compatible with the old driver usually? 2. Currently I want to add a new GUC parameter, if set it to true, server will create a holdable portal, or else nothing changed. Then let the user set it to true in the above case and reset it to false afterward. Is there any issue with this method? -- Best Regards Andy Fan
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Sun, Jul 26, 2020 at 4:54 PM Amit Kapila wrote: > > On Sat, Jul 25, 2020 at 8:42 PM Tom Lane wrote: > > > > No, "git diff --check" doesn't help. I have tried pgindent but that > also doesn't help neither was I expecting it to help. I am still not > able to figure out how I goofed up this but will spend some more time > on this. > I think I have figured out how the patch ended up having . Some editors add it when we use two spaces after a period (.) and I could see that my Gmail client does that for such a pattern. Normally, I use one of MSVC, vi, or NetBeans IDE to develop code/patch but sometimes I check the comments by pasting in Gmail client to find typos or such and then fix them manually. I guess in this case I have used Gmail client to write this comment and then copy/pasted it for the patch. I will be careful not to do this in the future. -- With Regards, Amit Kapila.
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Mon, Jul 27, 2020 at 10:48:45AM +1200, David Rowley wrote: > On Wed, 1 Jul 2020 at 18:46, Jeff Davis wrote: > > > > On Tue, Jun 30, 2020, 7:04 PM David Rowley wrote: > >> > >> Does anyone have any objections to that being changed? > > > > That's OK with me. By the way, I'm on vacation and will catch up on these > > HashAgg threads next week. > > (Adding Justin as I know he's expressed interest in the EXPLAIN output > of HashAgg before) Thanks. It's unrelated to hashAgg vs hashJoin, but I also noticed that this is shown only conditionally: if (es->format != EXPLAIN_FORMAT_TEXT) { if (es->costs && aggstate->hash_planned_partitions > 0) { ExplainPropertyInteger("Planned Partitions", NULL, aggstate->hash_planned_partitions, es); That was conditional since it was introduced at 1f39bce02: if (es->costs && aggstate->hash_planned_partitions > 0) { ExplainPropertyInteger("Planned Partitions", NULL, aggstate->hash_planned_partitions, es); } I think 40efbf870 should've handled this, too. -- Justin
Re: Fast DSM segments
On Sat, Jun 20, 2020 at 7:17 AM Andres Freund wrote: > On 2020-06-19 17:42:41 +1200, Thomas Munro wrote: > > On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro wrote: > > > Here's a version that adds some documentation. > > > > I jumped on a dual socket machine with 36 cores/72 threads and 144GB > > of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge > > pages available, and I ran a very simple test: select count(*) from t > > t1 join t t2 using (i), where the table was created with create table > > t as select generate_series(1, 4)::int i, and then prewarmed > > into 20GB of shared_buffers. > > I assume all the data fits into 20GB? Yep. > Which kernel version is this? Tested on 4.19 (Debian stable/10). > How much of the benefit comes from huge pages being used, how much from > avoiding the dsm overhead, and how much from the page table being shared > for that mapping? Do you have a rough idea? Without huge pages, the 36 process version of the test mentioned above shows around a 1.1x speedup, which is in line with the numbers from my first message (which was from a much smaller computer). The rest of the speedup (2x) is due to huge pages. Further speedups are available by increasing the hash chunk size, and probably doing NUMA-aware allocation, in later work. Here's a new version, using the name min_dynamic_shared_memory, which sounds better to me. Any objections? I also fixed the GUC's maximum setting so that it's sure to fit in size_t. From 3a77e8ab67e5492340425f1ce4a36a45a8c80d05 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 27 Jul 2020 12:07:05 +1200 Subject: [PATCH v3] Preallocate some DSM space at startup. Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by a new GUC min_dynamic_shared_memory, defaulting to 0. Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com --- doc/src/sgml/config.sgml | 26 +++ src/backend/storage/ipc/dsm.c | 184 -- src/backend/storage/ipc/dsm_impl.c| 3 + src/backend/storage/ipc/ipci.c| 3 + src/backend/utils/misc/guc.c | 11 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/backend/utils/mmgr/dsa.c | 5 +- src/include/storage/dsm.h | 3 + src/include/storage/dsm_impl.h| 1 + 9 files changed, 212 insertions(+), 25 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6ce5907896..48fef11041 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1864,6 +1864,32 @@ include_dir 'conf.d' + + min_dynamic_shared_memory (integer) + + min_dynamic_shared_memory configuration parameter + + + + +Specifies the amount of memory that should be allocated at server +startup time for use by parallel queries. When this memory region is +insufficient or exhausted by concurrent parallel queries, new +parallel queries try to allocate extra shared memory temporarily from +the operating system using the method configured with +dynamic_shared_memory_type, which may be slower +due to memory management overheads. +Memory that is allocated at startup time with +min_dynamic_shared_memory is affected by the +huge_pages setting on operating systems where that +is supported, and may be more likely to benefit from larger pages on +operating systems where page size is managed automatically. Larger +memory pages can improve the performance of parallel hash joins. +The default value is 0 (none). + + + + diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c index ef64d08357..b9941496a3 100644 --- a/src/backend/storage/ipc/dsm.c +++ b/src/backend/storage/ipc/dsm.c @@ -35,10 +35,12 @@ #include "lib/ilist.h" #include "miscadmin.h" +#include "port/pg_bitutils.h" #include "storage/dsm.h" #include "storage/ipc.h" #include "storage/lwlock.h" #include "storage/pg_shmem.h" +#include "utils/freepage.h" #include "utils/guc.h" #include "utils/memutils.h" #include "utils/resowner_private.h" @@ -76,6 +78,8 @@ typedef struct dsm_control_item { dsm_handle handle; uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */ + size_t first_page; + size_t npages; void *impl_pr
Re: [PATCH] Performance Improvement For Copy From Binary Files
On Sun, Jul 26, 2020 at 6:06 AM Tom Lane wrote: > Amit Langote writes: > > [ v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch ] > > Pushed with cosmetic changes. Thanks for that. > I'd always supposed that stdio does enough internal buffering that short > fread()s shouldn't be much worse than memcpy(). But I reproduced your > result of ~30% speedup for data with a lot of narrow text columns, using > RHEL 8.2. Thinking this an indictment of glibc, I also tried it on > current macOS, and saw an even bigger speedup, approaching 50%. So > there's definitely something to this. I wonder if we ought to check > other I/O-constrained users of fread and fwrite, like pg_dump/pg_restore. Ah, maybe a good idea to check that. > A point that I did not see addressed in the thread is whether this > has any negative impact on the copy-from-frontend code path, where > there's no fread() to avoid; short reads from CopyGetData() are > already going to be satisfied by memcpy'ing from the fe_msgbuf. > However, a quick check suggests that this patch is still a small > win for that case too --- apparently the control overhead in > CopyGetData() is not negligible. Indeed. > So the patch seems fine functionally, but there were some cosmetic > things I didn't like: > > * Removing CopyGetInt32 and CopyGetInt16 seemed like a pretty bad > idea, because it made the callers much uglier and more error-prone. > This is a particularly bad example: > > /* Header extension length */ > - if (!CopyGetInt32(cstate, &tmp) || > - tmp < 0) > + if (CopyReadBinaryData(cstate, (char *) &tmp, sizeof(tmp)) != > + sizeof(tmp) || (tmp = (int32) pg_ntoh32(tmp)) < 0) > > Putting side-effects into late stages of an if-condition is just > awful coding practice. They're easy for a reader to miss and they > are magnets for bugs, because of the possibility that control doesn't > reach that part of the condition. > > You can get the exact same speedup without any of those disadvantages > by marking these two functions "inline", so that's what I did. > > * I dropped the DRAIN_COPY_RAW_BUF macro too, as in my estimation it was > a net negative for readability. With only two use-cases, having it made > the code longer not shorter; I was also pretty unconvinced about the > wisdom of having some of the loop's control logic inside the macro and > some outside. > > * BTW, the macro definitions weren't particularly per project style > anyway. We generally put at least one space before line-ending > backslashes. I don't think pgindent will fix this for you; IME > it doesn't touch macro definitions at all. > > * Did some more work on the comments. Thanks for these changes. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: handle a ECPG_bytea typo
On Sat, Jul 25, 2020 at 06:17:42PM +0900, Michael Paquier wrote: > ECPGset_noind_null() and ECPGis_noind_null() in misc.c show that > ECPGgeneric_bytea is attached to ECPGt_bytea. The two structures may > be the same now, but if a bug fix or a code change involves a change > in the structure definition we could run into problems. So let's fix > and back-patch this change. I am not spotting other areas impacted, > and I'll try to take care at the beginning of next week. Okay, fixed as e971357. The issue came from 050710b, so this fix was only needed in 12~. -- Michael signature.asc Description: PGP signature
Re: Parallel bitmap index scan
On Mon, 27 Jul 2020 at 3:48 AM, Thomas Munro wrote: > On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar wrote: > > On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar > wrote: > > > > > > I would like to propose a patch for enabling the parallelism for the > > > bitmap index scan path. > > Workers Planned: 4 > -> Parallel Bitmap Heap Scan on tenk1 > Recheck Cond: (hundred > 1) > - -> Bitmap Index Scan on tenk1_hundred > + -> Parallel Bitmap Index Scan on tenk1_hundred > Index Cond: (hundred > 1) > > +1, this is a very good feature to have. > > + /* Merge bitmap to a common > shared bitmap */ > + SpinLockAcquire(&pstate->mutex); > + tbm_merge(tbm, > &pstate->tbm_shared, &pstate->pt_shared); > + SpinLockRelease(&pstate->mutex); > > This doesn't look like a job for a spinlock. Yes I agree with that. You have a barrier so that you can wait until all workers have > finished merging their partial bitmap into the complete bitmap, which > makes perfect sense. You also use that spinlock (probably should be > LWLock) to serialise the bitmap merging work... Hmm, I suppose there > would be an alternative design which also uses the barrier to > serialise the merge, and has the merging done entirely by one process, > like this: > > bool chosen_to_merge = false; > > /* Attach to the barrier, and see what phase we're up to. */ > switch (BarrierAttach()) > { > case PBH_BUILDING: > ... build my partial bitmap in shmem ... > chosen_to_merge = BarrierArriveAndWait(); > /* Fall through */ > > case PBH_MERGING: > if (chosen_to_merge) > ... perform merge of all partial results into final shmem bitmap ... > BarrierArriveAndWait(); > /* Fall through */ > > case PBH_SCANNING: > /* We attached too late to help build the bitmap. */ > BarrierDetach(); > break; > } > > Just an idea, not sure if it's a good one. I find it a little easier > to reason about the behaviour of late-attaching workers when the > phases are explicitly named and handled with code like that (it's not > immediately clear to me whether your coding handles late attachers > correctly, which seems to be one of the main problems to think about > with "dynamic party" parallelism...). Yeah this seems better idea. I am handling late attachers case but the idea of using the barrier phase looks quite clean. I will change it this way. > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: hashagg slowdown due to spill changes
On Sat, 2020-07-25 at 15:08 -0700, Peter Geoghegan wrote: > I find that Andres original "SELECT cat, count(*) FROM > fewgroups_many_rows GROUP BY 1;" test case is noticeably improved by > the patch. Without the patch, v13 takes ~11.46 seconds. With the > patch, it takes only ~10.64 seconds. I saw less of an improvement than you or Andres (perhaps just more noise). But given that both you and Andres are reporting a measurable improvement, then I went ahead and committed it. Thank you. Regards, Jeff Davis
Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.
On Wed, 1 Jul 2020 at 18:46, Jeff Davis wrote: > > On Tue, Jun 30, 2020, 7:04 PM David Rowley wrote: >> >> Does anyone have any objections to that being changed? > > That's OK with me. By the way, I'm on vacation and will catch up on these > HashAgg threads next week. (Adding Justin as I know he's expressed interest in the EXPLAIN output of HashAgg before) I've written a patch to bring the HashAgg EXPLAIN ANALYZE output to be more aligned to the Hash Join output. Couple of things I observed about Hash Join EXPLAIN ANALYZE: 1. The number of batches starts at 1. 2. We always display the number of batches. 3. We write "Batches" for text format and "Hash Batches" for non-text formats. 4. We write "Memory Usage" for text format and "Peak Memory Usage" for non-text formats. 5. "Batches" comes before memory usage. Before this patch, HashAgg EXPLAIN ANALYZE output would: 1. Start the number of batches at 0. 2. Only display "Hash Batches" when batches > 0. 3. Used the words "HashAgg Batches" for text and non-text formats. 4. Used the words "Peak Memory Usage" for text and non-text formats. 5. "Hash Batches" was written after memory usage. In the attached patch I've changed HashAgg to be aligned to Hash Join on each of the points above. e.g. Before: postgres=# explain analyze select c.relname,count(*) from pg_class c inner join pg_Attribute a on c.oid = a.attrelid group by c.relname; QUERY PLAN HashAggregate (cost=138.37..142.23 rows=386 width=72) (actual time=3.121..3.201 rows=427 loops=1) Group Key: c.relname Peak Memory Usage: 109kB -> Hash Join (cost=21.68..124.10 rows=2855 width=64) (actual time=0.298..1.768 rows=3153 loops=1) Hash Cond: (a.attrelid = c.oid) -> Seq Scan on pg_attribute a (cost=0.00..93.95 rows=3195 width=4) (actual time=0.011..0.353 rows=3153 loops=1) -> Hash (cost=16.86..16.86 rows=386 width=68) (actual time=0.279..0.279 rows=427 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 50kB -> Seq Scan on pg_class c (cost=0.00..16.86 rows=386 width=68) (actual time=0.007..0.112 rows=427 loops=1) Planning Time: 0.421 ms Execution Time: 3.294 ms (11 rows) After: postgres=# explain analyze select c.relname,count(*) from pg_class c inner join pg_Attribute a on c.oid = a.attrelid group by c.relname; QUERY PLAN -- HashAggregate (cost=566.03..580.00 rows=1397 width=72) (actual time=13.097..13.430 rows=1397 loops=1) Group Key: c.relname Batches: 1 Memory Usage: 321kB -> Hash Join (cost=64.43..496.10 rows=13985 width=64) (actual time=0.838..7.546 rows=13985 loops=1) Hash Cond: (a.attrelid = c.oid) -> Seq Scan on pg_attribute a (cost=0.00..394.85 rows=13985 width=4) (actual time=0.010..1.462 rows=13985 loops=1) -> Hash (cost=46.97..46.97 rows=1397 width=68) (actual time=0.820..0.821 rows=1397 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 153kB -> Seq Scan on pg_class c (cost=0.00..46.97 rows=1397 width=68) (actual time=0.009..0.362 rows=1397 loops=1) Planning Time: 0.440 ms Execution Time: 13.634 ms (11 rows) (ignore the change in memory consumption. That was due to adding records for testing) Any objections to this change? David yet_more_hashagg_explain_fixes.patch Description: Binary data
Re: Parallel bitmap index scan
On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar wrote: > On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar wrote: > > > > I would like to propose a patch for enabling the parallelism for the > > bitmap index scan path. Workers Planned: 4 -> Parallel Bitmap Heap Scan on tenk1 Recheck Cond: (hundred > 1) - -> Bitmap Index Scan on tenk1_hundred + -> Parallel Bitmap Index Scan on tenk1_hundred Index Cond: (hundred > 1) +1, this is a very good feature to have. + /* Merge bitmap to a common shared bitmap */ + SpinLockAcquire(&pstate->mutex); + tbm_merge(tbm, &pstate->tbm_shared, &pstate->pt_shared); + SpinLockRelease(&pstate->mutex); This doesn't look like a job for a spinlock. You have a barrier so that you can wait until all workers have finished merging their partial bitmap into the complete bitmap, which makes perfect sense. You also use that spinlock (probably should be LWLock) to serialise the bitmap merging work... Hmm, I suppose there would be an alternative design which also uses the barrier to serialise the merge, and has the merging done entirely by one process, like this: bool chosen_to_merge = false; /* Attach to the barrier, and see what phase we're up to. */ switch (BarrierAttach()) { case PBH_BUILDING: ... build my partial bitmap in shmem ... chosen_to_merge = BarrierArriveAndWait(); /* Fall through */ case PBH_MERGING: if (chosen_to_merge) ... perform merge of all partial results into final shmem bitmap ... BarrierArriveAndWait(); /* Fall through */ case PBH_SCANNING: /* We attached too late to help build the bitmap. */ BarrierDetach(); break; } Just an idea, not sure if it's a good one. I find it a little easier to reason about the behaviour of late-attaching workers when the phases are explicitly named and handled with code like that (it's not immediately clear to me whether your coding handles late attachers correctly, which seems to be one of the main problems to think about with "dynamic party" parallelism...).
Re: proposal: possibility to read dumped table's name from file
On Sat, Jul 25, 2020 at 06:56:31PM +0530, vignesh C wrote: > On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule > wrote: > >> I meant can this: > >> printf(_(" --filter=FILENAMEread object name filter > >> expressions from file\n")); > >> be changed to: > >> printf(_(" --filter=FILENAMEdump objects and data based > >> on the filter expressions from the filter file\n")); > > > > done in today patch > > Thanks for fixing the comments. > Few comments: > + /* use "-" as symbol for stdin */ > + if (strcmp(filename, "-") != 0) > + { > + fp = fopen(filename, "r"); > + if (!fp) > + fatal("could not open the input file \"%s\": %m", > + filename); > + } > + else > + fp = stdin; > > We could use STDIN itself instead of -, it will be a more easier > option to understand. I think "-" is used widely for commandline tools, and STDIN is not (even though it's commonly used by programmers). For example, since last year, pg_restore -f - means stdout. -- Justin
Re: expose parallel leader in CSV and log_line_prefix
On Sun, Jul 26, 2020 at 04:42:06PM +0900, Michael Paquier wrote: > On Thu, Jul 23, 2020 at 09:52:14AM +0900, Michael Paquier wrote: > > Sounds fine to me. Thanks. > > > > Do others have any objections with this wording? > > I have used the wording suggested by Alvaro, and applied the patch > down to 13. Now let's see about the original item of this thread.. Updated with updated wording to avoid "null", per Tom. -- Justin >From 665c9ea8827d4ad94ea0100c5ba93dbf05db5943 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 13 Mar 2020 22:03:06 -0500 Subject: [PATCH v3] Include the leader PID in logfile See also: b025f32e0b, which adds the leader PID to pg_stat_activity --- doc/src/sgml/config.sgml | 10 ++- src/backend/utils/error/elog.c| 30 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + 3 files changed, 40 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e806b13754..c525a0fd15 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,13 @@ local0.*/var/log/postgresql Process ID no + + %P + For a parallel worker, this is the Process ID of its leader + process. + + no + %t Time stamp without milliseconds @@ -7019,7 +7026,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a ' character count of the error position therein, location of the error in the PostgreSQL source code (if log_error_verbosity is set to verbose), -application name, and backend type. +application name, backend type, and leader PID. Here is a sample table definition for storing CSV-format log output: @@ -7049,6 +7056,7 @@ CREATE TABLE postgres_log location text, application_name text, backend_type text, + leader_pid integer, PRIMARY KEY (session_id, session_line_num) ); diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index e4b717c79a..3055b72c01 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -77,6 +77,7 @@ #include "postmaster/syslogger.h" #include "storage/ipc.h" #include "storage/proc.h" +#include "storage/procarray.h" #include "tcop/tcopprot.h" #include "utils/guc.h" #include "utils/memutils.h" @@ -2448,6 +2449,25 @@ log_line_prefix(StringInfo buf, ErrorData *edata) else appendStringInfo(buf, "%d", MyProcPid); break; + + case 'P': +if (MyProc) +{ + PGPROC *leader = MyProc->lockGroupLeader; + if (leader == NULL || leader->pid == MyProcPid) + /* padding only */ + appendStringInfoSpaces(buf, +padding > 0 ? padding : -padding); + else if (padding != 0) + appendStringInfo(buf, "%*d", padding, leader->pid); + else + appendStringInfo(buf, "%d", leader->pid); +} +else if (padding != 0) + appendStringInfoSpaces(buf, + padding > 0 ? padding : -padding); +break; + case 'l': if (padding != 0) appendStringInfo(buf, "%*ld", padding, log_line_number); @@ -2836,6 +2856,16 @@ write_csvlog(ErrorData *edata) else appendCSVLiteral(&buf, GetBackendTypeDesc(MyBackendType)); + appendStringInfoChar(&buf, ','); + + /* leader PID */ + if (MyProc) + { + PGPROC *leader = MyProc->lockGroupLeader; + if (leader && leader->pid != MyProcPid) + appendStringInfo(&buf, "%d", leader->pid); + } + appendStringInfoChar(&buf, '\n'); /* If in the syslogger process, try to write messages direct to file */ diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index aa30291ea3..55c2e23c47 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -536,6 +536,7 @@ # %h = remote host # %b = backend type # %p = process ID + # %P = leader PID # %t = timestamp without milliseconds # %m = timestamp with milliseconds # %n = timestamp with milliseconds (as a Unix epoch) -- 2.17.0
Re: Default setting for enable_hashagg_disk
On Sat, Jul 25, 2020 at 05:13:00PM -0700, Peter Geoghegan wrote: On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra wrote: I'm not sure what you mean by "reported memory usage doesn't reflect the space used for transition state"? Surely it does include that, we've built the memory accounting stuff pretty much exactly to do that. I think it's pretty clear what's happening - in the sorted case there's only a single group getting new values at any moment, so when we decide to spill we'll only add rows to that group and everything else will be spilled to disk. Right. In the unsorted case however we manage to initialize all groups in the hash table, but at that point the groups are tiny an fit into work_mem. As we process more and more data the groups grow, but we can't evict them - at the moment we don't have that capability. So we end up processing everything in memory, but significantly exceeding work_mem. work_mem was set to 200MB, which is more than the reported "Peak Memory Usage: 1605334kB". So either the random case significantly That's 1.6GB, if I read it right. Which is more than 200MB ;-) exceeds work_mem and the "Peak Memory Usage" accounting is wrong (because it doesn't report this excess), or the random case really doesn't exceed work_mem but has a surprising advantage over the sorted case. -- Peter Geoghegan -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: recovering from "found xmin ... from before relfrozenxid ..."
> 24 июля 2020 г., в 14:05, Ashutosh Sharma написал(а): > > Attached is the patch that adds heap_force_kill(regclass, tid[]) and > heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the > first email in this thread. The patch basically adds an extension named > pg_surgery that contains these functions. Please have a look and let me know > your feedback. Thank you. Thanks for the patch! I have just few random thoughts. I think here we should report that we haven't done what was asked. + /* Nothing to do if the itemid is unused or already dead. */ + if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) + continue; Also, should we try to fix VM along the way? Are there any caveats with concurrent VACUUM? (I do not see any, just asking) It would be good to have some checks for interrupts in safe places. I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump. + Assert(noffs <= PageGetMaxOffsetNumber(page)); For some reason we had unlogged versions of these functions. But I do not recall exact rationale.. Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first. Thanks! Best regards, Andrey Borodin.
Re: Making CASE error handling less surprising
On Fri, Jul 24, 2020 at 7:18 PM Tom Lane wrote: > Robert Haas writes: > > Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I > > believe this. Pavel's example is a good one. The leakproof exception > > helps, but it doesn't cover everything. Users I've encountered throw > > things like date_trunc() and lpad() into SQL code and expect them to > > behave (from a performance point of view) like constants, but they > > also expect 1/0 not to get evaluated too early when e.g. CASE is used. > > It's difficult to meet both sets of expectations at the same time and > > we're probably never going to have a perfect solution, but I think > > you're minimizing the concern too much here. > > I've quoted this point before, but ... we can make queries arbitrarily > fast, if we don't have to give the right answer. I think we've seen > enough complaints on this topic now to make it clear that what we're > doing today with CASE is the wrong answer. > So here's my concern in a little more detail. For small databases, these performance concerns are not big deals. But for large, heavily loaded databases one tends to run into all of the pathological cases more frequently. In other words the overhead for the largest users will likely not be proportional to the gains of the newer users who are surprised by the current behavior. The more complex we make exceptions as to how the planner works, the more complex the knowledge required to work on the high end of the database is. So the complexity here is such that I just don't think is worth it. > The performance argument can be made to cut both ways, too. If somebody's > got a very expensive function in a CASE arm that they don't expect to > reach, having it be evaluated anyway because it's got constant inputs > isn't going to make them happy. > However in this case we would be evaluating the expensive case arm every time it is invoked (i.e. for every row matched), right? It is hard to see this as even being close to a performance gain or even approximately neutral because the cases where you have a significant gain are likely to be extremely rare, and the penalties for when the cost applies will be many multiples of the maximum gain. > > The real bottom line is: if you don't want to do this, how else do > you want to fix the problem? I'm no longer willing to deny that > there is a problem. > I see three ways forward. The first (probably the best) would be a solution along the lines of yours along with a session-level GUC variable which could determine whether case branches can fold constants. This has several important benefits: 1. It gets a fix in shortly for those who want it. 2. It ensures this is optional behavior for the more experienced users (where one can better decide which direction to go), and 3. It makes the behavior explicit, documented, and thus more easily understood. A third approach would be to allow some sort of "constant evaluation mechanism" maybe with its own memory context where constants could be cached on first evaluation under the statement memory context. That would solve the problem more gneerally. > > > I don't think I believe this either. I don't think an average user is > > going to expect to behave differently from (SELECT > > ). > > Agreed, that's poorly (or not at all?) documented. But it's been > true all along, and this patch isn't changing that behavior at all. > I'm not sure if we should do anything more than improve the docs, > but in any case it seems independent of the CASE issue. > > > The current behavior isn't great, but at least it handles these > > cases consistently. > > Really? > > regards, tom lane > > > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: Parallel bitmap index scan
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar wrote: > > I would like to propose a patch for enabling the parallelism for the > bitmap index scan path. > > Background: > Currently, we support only a parallel bitmap heap scan path. Therein, > the underlying bitmap index scan is done by a single worker called the > leader. The leader creates a bitmap in shared memory and once the > bitmap is ready it creates a shared iterator and after that, all the > workers process the shared iterator and scan the heap in parallel. > While analyzing the TPCH plan we have observed that some of the > queries are spending significant time in preparing the bitmap. So the > idea of this patch is to use the parallel index scan for preparing the > underlying bitmap in parallel. > > Design: > If underlying index AM supports the parallel path (currently only > BTREE support it), then we will create a parallel bitmap heap scan > path on top of the parallel bitmap index scan path. So the idea of > this patch is that each worker will do the parallel index scan and > generate their part of the bitmap. And, we will create a barrier so > that we can not start preparing the shared iterator until all the > worker is ready with their bitmap. The first worker, which is ready > with the bitmap will keep a copy of its TBM and the page table in the > shared memory. And, all the subsequent workers will merge their TBM > with the shared TBM. Once all the TBM are merged we will get one > common shared TBM and after that stage, the worker can continue. The > remaining part is the same, basically, again one worker will scan the > shared TBM and prepare the shared iterator and once it is ready all > the workers will jointly scan the heap in parallel using shared > iterator. > > BitmapHeapNext > { > ... > BarrierAttach(); > tbm = MultiExecProcNode(); > tbm_merge(tbm); --Merge with common tbm using tbm_union > BarrierArriveAndWait(); > > if (BitmapShouldInitializeSharedState(pstate)). --> only one worker > come out of this > { >tbm_prepare_shared_iterate(); >BitmapDoneInitializingSharedState(). -->wakeup others > } > tbm_attach_shared_iterate(). --> all worker attach to shared iterator > ... > } > > Performance: With scale factor 10, I could see that Q6 is spending > significant time in a bitmap index scan so I have taken the > performance with that query and I can see that the bitmap index scan > node is 3x faster by using 3 workers whereas overall plan got ~40% > faster. > > TPCH: S.F. 10, work_mem=512MB shared_buffers: 1GB > > HEAD: > Limit (cost=1559777.02..1559777.03 rows=1 width=32) (actual > time=5260.121..5260.122 rows=1 loops=1) >-> Finalize Aggregate (cost=1559777.02..1559777.03 rows=1 > width=32) (actual time=5260.119..5260.119 rows=1 loops=1) > -> Gather (cost=1559776.69..1559777.00 rows=3 width=32) > (actual time=5257.251..5289.595 rows=4 loops=1) >Workers Planned: 3 >Workers Launched: 3 >-> Partial Aggregate (cost=1558776.69..1558776.70 > rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4) > -> Parallel Bitmap Heap Scan on lineitem > (cost=300603.01..1556898.89 rows=375560 width=12) (actual > time=3475.944..50 > 37.484 rows=285808 loops=4) >Recheck Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without tim > e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) >Heap Blocks: exact=205250 >-> Bitmap Index Scan on > idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0) > (actual time=3169.85 > 5..3169.855 rows=1143234 loops=1) > Index Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without > time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND > (l_quantity < '24'::numeric)) > Planning Time: 0.659 ms > Execution Time: 5289.787 ms > (13 rows) > > > PATCH: > > Limit (cost=1559579.85..1559579.86 rows=1 width=32) (actual > time=.572...572 rows=1 loops=1) >-> Finalize Aggregate (cost=1559579.85..1559579.86 rows=1 > width=32) (actual time=.569...569 rows=1 loops=1) > -> Gather (cost=1559579.52..1559579.83 rows=3 width=32) > (actual time=3328.619..3365.227 rows=4 loops=1) >Workers Planned: 3 >Workers Launched: 3 >-> Partial Aggregate (cost=1558579.52..1558579.53 > rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4) > -> Parallel Bitmap Heap Scan on lineitem > (cost=300405.84..1556701.72 rows=375560 width=12) (actual > time=1585.726..30 > 97.628 rows=285808 loops=4) >Recheck Cond: ((l_shipdate >= > '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp > without tim > e zone) AND (l_di
Parallel bitmap index scan
I would like to propose a patch for enabling the parallelism for the bitmap index scan path. Background: Currently, we support only a parallel bitmap heap scan path. Therein, the underlying bitmap index scan is done by a single worker called the leader. The leader creates a bitmap in shared memory and once the bitmap is ready it creates a shared iterator and after that, all the workers process the shared iterator and scan the heap in parallel. While analyzing the TPCH plan we have observed that some of the queries are spending significant time in preparing the bitmap. So the idea of this patch is to use the parallel index scan for preparing the underlying bitmap in parallel. Design: If underlying index AM supports the parallel path (currently only BTREE support it), then we will create a parallel bitmap heap scan path on top of the parallel bitmap index scan path. So the idea of this patch is that each worker will do the parallel index scan and generate their part of the bitmap. And, we will create a barrier so that we can not start preparing the shared iterator until all the worker is ready with their bitmap. The first worker, which is ready with the bitmap will keep a copy of its TBM and the page table in the shared memory. And, all the subsequent workers will merge their TBM with the shared TBM. Once all the TBM are merged we will get one common shared TBM and after that stage, the worker can continue. The remaining part is the same, basically, again one worker will scan the shared TBM and prepare the shared iterator and once it is ready all the workers will jointly scan the heap in parallel using shared iterator. BitmapHeapNext { ... BarrierAttach(); tbm = MultiExecProcNode(); tbm_merge(tbm); --Merge with common tbm using tbm_union BarrierArriveAndWait(); if (BitmapShouldInitializeSharedState(pstate)). --> only one worker come out of this { tbm_prepare_shared_iterate(); BitmapDoneInitializingSharedState(). -->wakeup others } tbm_attach_shared_iterate(). --> all worker attach to shared iterator ... } Performance: With scale factor 10, I could see that Q6 is spending significant time in a bitmap index scan so I have taken the performance with that query and I can see that the bitmap index scan node is 3x faster by using 3 workers whereas overall plan got ~40% faster. TPCH: S.F. 10, work_mem=512MB shared_buffers: 1GB HEAD: Limit (cost=1559777.02..1559777.03 rows=1 width=32) (actual time=5260.121..5260.122 rows=1 loops=1) -> Finalize Aggregate (cost=1559777.02..1559777.03 rows=1 width=32) (actual time=5260.119..5260.119 rows=1 loops=1) -> Gather (cost=1559776.69..1559777.00 rows=3 width=32) (actual time=5257.251..5289.595 rows=4 loops=1) Workers Planned: 3 Workers Launched: 3 -> Partial Aggregate (cost=1558776.69..1558776.70 rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4) -> Parallel Bitmap Heap Scan on lineitem (cost=300603.01..1556898.89 rows=375560 width=12) (actual time=3475.944..50 37.484 rows=285808 loops=4) Recheck Cond: ((l_shipdate >= '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp without tim e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND (l_quantity < '24'::numeric)) Heap Blocks: exact=205250 -> Bitmap Index Scan on idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0) (actual time=3169.85 5..3169.855 rows=1143234 loops=1) Index Cond: ((l_shipdate >= '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp without time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND (l_quantity < '24'::numeric)) Planning Time: 0.659 ms Execution Time: 5289.787 ms (13 rows) PATCH: Limit (cost=1559579.85..1559579.86 rows=1 width=32) (actual time=.572...572 rows=1 loops=1) -> Finalize Aggregate (cost=1559579.85..1559579.86 rows=1 width=32) (actual time=.569...569 rows=1 loops=1) -> Gather (cost=1559579.52..1559579.83 rows=3 width=32) (actual time=3328.619..3365.227 rows=4 loops=1) Workers Planned: 3 Workers Launched: 3 -> Partial Aggregate (cost=1558579.52..1558579.53 rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4) -> Parallel Bitmap Heap Scan on lineitem (cost=300405.84..1556701.72 rows=375560 width=12) (actual time=1585.726..30 97.628 rows=285808 loops=4) Recheck Cond: ((l_shipdate >= '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp without tim e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND (l_quantity < '24'::numeric)) Heap Blocks: exact=184293 -> Parallel Bitmap Index Scan on idx_lineitem_shipdate (cost=0.00..300311.95 rows=1164235 width=0) (actual tim
Re: INSERT INTO SELECT, Why Parallelism is not selected?
On Sat, Jul 25, 2020 at 8:42 PM Tom Lane wrote: > > Amit Kapila writes: > > On Fri, Jul 24, 2020 at 7:36 PM Tom Lane wrote: > >> Yeah, the proposed comment changes don't actually add much. Also > >> please try to avoid inserting non-ASCII into the source code; > >> at least in my mail reader, that attachment has some. > > > I don't see any non-ASCII characters in the patch. I have applied and > > checked (via vi editor) the patch as well but don't see any non-ASCII > > characters. How can I verify that? > > They're definitely there: > > $ od -c 0001-Fix-comments-in-heapam.c.patch > ... > 0002740 h e \n + \t * l e a d e r c > 0002760 a n p e r f o r m t h e i > 0003000 n s e r t . 302 240 T h i s r e > 0003020 s t r i c t i o n c a n b e > 0003040 u p l i f t e d o n c e w > 0003060 e \n + \t * a l l o w t h e > 0003100 302 240 p l a n n e r t o g e n > 0003120 e r a t e p a r a l l e l p > 0003140 l a n s f o r i n s e r t s > 0003160 . \n \t * / \n \t i f ( I s > ... > Thanks, I could see that. > I'm not sure if "git diff --check" would whine about this. > No, "git diff --check" doesn't help. I have tried pgindent but that also doesn't help neither was I expecting it to help. I am still not able to figure out how I goofed up this but will spend some more time on this. In the meantime, I have updated the patch to improve the comments as suggested by Robert. Do let me know if you want to edit/add something more? -- With Regards, Amit Kapila. v2-0001-Fix-comments-in-heapam.c.patch Description: Binary data
Re: Difference for Binary format vs Text format for client-server communication
On Sun, Jul 26, 2020 at 1:49 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2020-07-16 18:52, Andy Fan wrote: > > The reason I ask this is because I have a task to make numeric output > > similar to oracle. > > > > Oracle: > > > > SQL> select 2 / 1.0 from dual; > > > > 2/1.0 > > -- > > 2 > > > > PG: > > > > postgres=# select 2 / 1.0; > >?column? > > > > 2. > > (1 row) > > > > If the user uses text format, I can just hack some numeric_out function, > > but if they > > use binary format, looks I have to change the driver they used for it. > > Am I > > understand it correctly? > > I think what you should be looking at is why the numeric division > function produces that scale and possibly make changes there. Thanks, I think you are talking about the select_div_scale function, which is called before the real division task in div_var. so it will be hard to hack at that part. Beside that, oracle returns the zero-trim version no matter if division is involved(I forgot to mention at the first). At last, I just hacked the numeric_out function, then it works like Oracle now. However it just works in text format. I tried JDBC, and it uses text format by default. The solution is not good enough but it is ok for my purpose currently. IIUC, if a driver uses text protocol for a data type, then it works like this: 1). server gets a value in binary format. 2). server convert it to string and send it via network, 3). client gets the string. 4). client converts the string to a given data type. looks it is much more complex than binary protocol. then why text protocol is chosen by default. > By the > time the type's output or send function is invoked, that's already > decided. The output/send functions are not the place to make scale or > other semantic adjustments. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Best Regards Andy Fan
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 15 Jul 2020 at 12:24, David Rowley wrote: > If we've not seen any performance regressions within 1 week, then I > propose that we (pending final review) push this to allow wider > testing. It seems we're early enough in the PG14 cycle that there's a > large window of time for us to do something about any reported > performance regressions that come in. I did that final review which ended up in quite a few cosmetic changes. Functionality-wise, it's basically that of the v2 patch with the PARALLEL_SEQSCAN_MAX_CHUNK_SIZE set to 8192. I mentioned that we might want to revisit giving users some influence on the chunk size, but we'll only do so once we see some conclusive evidence that it's worthwhile. David
Re: expose parallel leader in CSV and log_line_prefix
On Thu, Jul 23, 2020 at 09:52:14AM +0900, Michael Paquier wrote: > Sounds fine to me. Thanks. > > Do others have any objections with this wording? I have used the wording suggested by Alvaro, and applied the patch down to 13. Now let's see about the original item of this thread.. -- Michael signature.asc Description: PGP signature
Re: OpenSSL randomness seeding
On Wed, Jul 22, 2020 at 11:31:38PM +0200, Daniel Gustafsson wrote: > Thanks for picking it up! For the archives, the patch set has been applied as ce4939f and 15e4419 on HEAD. Thanks, Noah. > That's a good question. I believe that if one actually do use RAND_cleanup as > a re-seeding mechanism then that can break FIPS enabled OpenSSL installations > as RAND_cleanup resets the RNG method from the FIPS RNG to the built-in one. > I > would be inclined to follow the upstream recommendations of using RAND_poll > exclusively, but I'm far from an expert here. RAND_cleanup() can cause a failure telling that the RNG state is not initialized when attempting to use FIPS in 1.0.2. This is not officially supported by upstream AFAIK, and those APIs have been dropped later in 1.1.0. And FWIW, VMware's Photon actually applies some custom patches in this area: https://github.com/vmware/photon/tree/master/SPECS/openssl openssl-drbg-default-read-system-fips.patch is used to enforce the initialization state of FIPS for example. Anyway, I would just stick with the wiki recommendation. >> Do you happen to know how OpenSSL 1.1.1 changed its reaction to forks in >> order >> to make the RAND_poll() superfluous? (No need to research it if you don't.) > > I'm not entirely sure, but I do believe that 1.1.1 ported over the RNG from > the > FIPS module which re-seeds itself with fork() protection. There was however a > bug, fixed in 1.1.1d or thereabouts as CVE-2019-1549, where the fork > protection > wasn't activated by default.. so there is that. Since that bug was found, > there has been tests introduced to catch any regression on that which is > comforting. No idea about this one actually. -- Michael signature.asc Description: PGP signature