RE: Implement UNLOGGED clause for COPY FROM
Hi. Amit-san > If you are going to suggest users not to replicate such tables then why can't > you > suggest them to create such tables as UNLOGGED in the first place? Another > idea could be that you create an 'unlogged' > table, copy the data to it. Then perform Alter Table .. SET Logged and > attach it to > the main table. I think for this you need the main table to be partitioned > but I > think if that is possible then it might be better than all the hacking you are > proposing to do in the server for this special operation. Thank you for your comment. At the beginning, I should have mentioned this function was for data warehouse, where you need to load large amounts of data in the shortest amount of time. Sorry for my bad explanation. Based on the fact that data warehouse cannot be separated from usage of applications like B.I. tool in general, we cannot define unlogged table at the beginning easily. Basically, such tools don't support to define unlogged table as far as I know. And if you want to do so, you need *modification or fix of existing application* which is implemented by a third party and commercially available for data analytics. In other words, to make CREATE UNLOGGED TABLE available in that application, you must revise the product's source code of the application directly, which is an act to invalidate the warranty from the software company of B.I. tool. In my opinion, it would be like unrealistic for everyone to do so. Best, Takamichi Osumi
Re: OpenSSL randomness seeding
On Tue, Jul 21, 2020 at 10:00:20PM -0700, Noah Misch wrote: > These look good. I'll push them on Saturday or later. I wondered whether to > do both RAND_cleanup() and RAND_poll(), to purge all traces of the old seed on > versions supporting both. Since that would strictly (albeit negligibly) > increase seed predictability, I like your decision here. Thanks Noah for taking care of it. No plans for a backpatch, right? -- Michael signature.asc Description: PGP signature
Re: Parallel Seq Scan vs kernel read ahead
On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com wrote: > I used the default max_parallel_workers & max_worker_proceses which is 8 by > default in postgresql.conf. > IOW, I ran all those tests with maximum of 8 processes set. But my query > planner capped both the > Workers Planned and Launched at 6 for some reason when increasing the value > for > max_parallel_workers_per_gather. max_parallel_workers_per_gather just imposes a limit on the planner as to the maximum number of parallel workers it may choose for a given parallel portion of a plan. The actual number of workers the planner will decide is best to use is based on the size of the relation. More pages = more workers. It sounds like in this case the planner didn't think it was worth using more than 6 workers. The parallel_workers reloption, when not set to -1 overwrites the planner's decision on how many workers to use. It'll just always try to use "parallel_workers". > However, when I used the ALTER TABLE SET (parallel_workers = N) based from > your suggestion, > the query planner acquired that set value only for "Workers Planned", but not > for "Workers Launched". > The behavior of query planner is also different when I also set the value of > max_worker_processes > and max_parallel_workers to parallel_workers + 1. When it comes to execution, the executor is limited to how many parallel worker processes are available to execute the plan. If all workers happen to be busy with other tasks then it may find itself having to process the entire query in itself without any help from workers. Or there may be workers available, just not as many as the planner picked to execute the query. The number of available workers is configured with the "max_parallel_workers". That's set in postgresql.conf. PostgreSQL won't complain if you try to set a relation's parallel_workers reloption to a number higher than the "max_parallel_workers" GUC. "max_parallel_workers" is further limited by "max_worker_processes". Likely you'll want to set both those to at least 32 for this test, then just adjust the relation's parallel_workers setting for each test. David
Re: OpenSSL randomness seeding
On Tue, Jul 21, 2020 at 02:13:32PM +0200, Daniel Gustafsson wrote: > The silver lining here is that while OpenSSL nooped RAND_cleanup, they also > changed what is mixed into seeding so we are still not sharing a sequence. To > fix this, changing the RAND_cleanup call to RAND_poll should be enough to > ensure re-seeding after forking across all supported OpenSSL versions. Patch > 0001 implements this along with a comment referencing when it can be removed > (which most likely won't be for quite some time). > > Another thing that stood out when reviewing this code is that we optimize for > RAND_poll failing in pg_strong_random, when we already have RAND_status > checking for a sufficiently seeded RNG for us. ISTM that we can simplify the > code by letting RAND_status do the work as per 0002, and also (while unlikely) > survive any transient failures in RAND_poll by allowing all the retries we've > defined for the loop. > Thoughts on these? These look good. I'll push them on Saturday or later. I wondered whether to do both RAND_cleanup() and RAND_poll(), to purge all traces of the old seed on versions supporting both. Since that would strictly (albeit negligibly) increase seed predictability, I like your decision here. Do you happen to know how OpenSSL 1.1.1 changed its reaction to forks in order to make the RAND_poll() superfluous? (No need to research it if you don't.)
Re: xl_heap_header alignment?
Tom Lane wrote: > I don't particularly want to remove the field, but we ought to > change or remove the comment. I'm not concerned about the existence of the field as well. The comment just made me worried that I might be missing some fundamental concept. Thanks for your opinion. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila wrote: > Yeah, that is true but every time before the test the same amount of > data should be present in shared buffers (or OS cache) if any which > will help in getting consistent results. However, it is fine to > reboot the machine as well if that is a convenient way. We really should have an extension (pg_prewarm?) that knows how to kick stuff out PostgreSQL's shared buffers and the page cache (POSIX_FADV_DONTNEED).
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jul 22, 2020 at 5:25 AM David Rowley wrote: > > I understand that Amit wrote: > > On Fri, 17 Jul 2020 at 21:18, Amit Kapila wrote: > > I think recreating the database and restarting the server after each > > run might help in getting consistent results. Also, you might want to > > take median of three runs. > > Please also remember, if you're recreating the database after having > restarted the machine that creating the table will likely end up > caching some of the pages either in shared buffers or the Kernel's > cache. > Yeah, that is true but every time before the test the same amount of data should be present in shared buffers (or OS cache) if any which will help in getting consistent results. However, it is fine to reboot the machine as well if that is a convenient way. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar wrote: > > There was one warning in release mode in the last version in 0004 so > attaching a new version. > Today, I was reviewing patch v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a small problem with it. + /* + * Execute the invalidations for xid-less transactions, + * otherwise, accumulate them so that they can be processed at + * the commit time. + */ + if (!ctx->fast_forward) + { + if (TransactionIdIsValid(xid)) + { + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, + buf->origptr); + } I think we need to set ReorderBufferXidSetCatalogChanges even when ctx->fast-forward is true because we are dependent on that flag for snapshot build (see SnapBuildCommitTxn). We are already doing the same way in DecodeCommit where even though we skip adding invalidations for fast-forward cases but we do set the flag to indicate that this txn has catalog changes. Is there any reason to do things differently here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
RE: archive status ".ready" files may be created too early
Hello, > At Mon, 13 Jul 2020 01:57:36 +, "Kyotaro Horiguchi > " wrote in > Am I missing something here? I write more detail(*). Record-A and Record-B are cross segment-border records. Record-A spans segment X and X+1. Record-B spans segment X+2 and X+3. If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B * If a writer flushes segment X and a part of X+1 but record-A is not flushed completely, NotifyStableSegments() allows the writer to notify segment-X. Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record. The segment-X is not same as the archived one. Regard Ryo Matsumura
Re: Stale external URL in doc?
At Sat, 18 Jul 2020 22:48:47 +0900, Michael Paquier wrote in > On Fri, Jul 17, 2020 at 02:03:18PM +0900, Michael Paquier wrote: > > It would be better to get all that fixed and backpatched. Is somebody > > already looking into that? > > I have been through this set, and applied the changes as of 045d03f & > friends. There was an extra URL broken in 9.5 and 9.6 related to the > passphrase FAQ. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: OpenSSL randomness seeding
On Tue, Jul 21, 2020 at 10:36:53PM +0200, Daniel Gustafsson wrote: > I think the original intention was to handle older OpenSSL versions where > multiple successful RAND_poll calls were required for RAND_status to succeed, > the check working as an optimization since a failing RAND_poll would render > all > efforts useless anyway. I'm not sure this is true for the OpenSSL versions we > support in HEAD, and/or for modern platforms, but without proof of it not > being > useful I would opt for keeping it. Yeah, the retry loop refers to this part of the past discussion on the matter: https://www.postgresql.org/message-id/CAEZATCWYs6rAp36VKm4W7Sb3EF_7tNcRuhcnJC1P8=8w9nb...@mail.gmail.com During the rewrite of the RNG engines, there was also a retry logic introduced in 75e2c87, then removed in c16de9d for 1.1.1. In short, we may be able to live without that once we cut more support for OpenSSL versions (minimum version support of 1.1.1 is a couple of years ahead at least for us), but I see no reasons to not leave that in place either. And this visibly solved one problem for us. I don't see either a reason to not simplify the loop to fall back to RAND_status() for the validation. In short, the proposed patch set looks like a good idea to me to stick with the recommendations of upstream's wiki to use RAND_poll() after a fork, but only do that on HEAD (OpenSSL 1.1.0 mixes the current timestamp and the PID in the random seed of the default engine, 1.0.2 only the PID but RAND_cleanup is a no-op only after 1.1.0). -- Michael signature.asc Description: PGP signature
Re: Comment referencing incorrect algorithm
On Tue, Jul 21, 2020 at 01:57:11PM +0200, Daniel Gustafsson wrote: > While poking around our crypto code, I noticed that a comment in sha2.h was > referencing sha-1 which is an algorithm not supported by the code. The > attached fixes the comment aligning it with other comments in the file. Thanks, fixed. The style of the surroundings is to not use an hyphen, so fine by me to stick with that. -- Michael signature.asc Description: PGP signature
Re: v13 planner ERROR: could not determine which collation to use for string comparison
Michael Paquier writes: > On Tue, Jul 21, 2020 at 06:25:00PM -0400, Tom Lane wrote: >> Ugh. It's clear from your stack trace that neqjoinsel() has forgotten to >> pass through collation to eqjoinsel(). Will fix. > Why didn't you include a regression test in bd0d893? Didn't really see much point. It's not like anybody's likely to take out the collation handling now that it's there. regards, tom lane
Re: v13 planner ERROR: could not determine which collation to use for string comparison
On Tue, Jul 21, 2020 at 06:25:00PM -0400, Tom Lane wrote: > Ugh. It's clear from your stack trace that neqjoinsel() has forgotten to > pass through collation to eqjoinsel(). Will fix. Why didn't you include a regression test in bd0d893? -- Michael signature.asc Description: PGP signature
Re: Which SET TYPE don't actually require a rewrite
On Tue, Jul 21, 2020 at 04:55:37PM -0400, Bruce Momjian wrote: > I know Tom put a wink on that, but I actually do feel that the only > clean way to do this is to give users a way to issue the query in a > non-executing way that will report if a rewrite is going to happen. Yeah, when doing a schema upgrade for an application, that's the usual performance pin-point and people used to other things than Postgres write their queries without being aware of that. We have something able to track that with the event trigger table_rewrite, but there is no easy option to store the event and bypass its execution. I think that using a plpgsql function wrapping an ALTER TABLE query with an exception block for an error generated by an event trigger if seeing table_rewrite allows to do that, though. -- Michael signature.asc Description: PGP signature
Re: Parallel Seq Scan vs kernel read ahead
Hi Kirk, Thank you for doing some testing on this. It's very useful to get some samples from other hardware / filesystem / os combinations. On Tue, 21 Jul 2020 at 21:38, k.jami...@fujitsu.com wrote: > Query Planner I/O Timings (ms): > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O > WRITE (Patch) | > ||---|--||---| > | 0 | "1,130.78"| "1,250.82" | "1,698.05" | > "1,733.44"| > | Worker | Buffers | > ||--| > | 0 | shared read=442478 dirtied=442478 written=442446 | I'm thinking the scale of this test might be a bit too small for the machine you're using to test. When you see "shared read" in the EXPLAIN (ANALYZE, BUFFERS) output, it does not necessarily mean that the page had to be read from disk. We use buffered I/O, so the page could just have been fetched from the kernel's cache. If we do some maths here on the timing. It took 1130.78 milliseconds to read 442478 pages, which, assuming the standard page size of 8192 bytes, that's 3457 MB in 1130.78 milliseconds, or 3057 MB/sec. Is that a realistic throughput for this machine in terms of I/O? Or do you think that some of these pages might be coming from the Kernel's cache? I understand that Amit wrote: On Fri, 17 Jul 2020 at 21:18, Amit Kapila wrote: > I think recreating the database and restarting the server after each > run might help in getting consistent results. Also, you might want to > take median of three runs. Please also remember, if you're recreating the database after having restarted the machine that creating the table will likely end up caching some of the pages either in shared buffers or the Kernel's cache. It would be better to leave the database intact and just reboot the machine. I didn't really like that option with my tests so I just increased the size of the table beyond any size that my machines could have cached. With the 16GB RAM Windows laptop, I used a 100GB table and with the 64GB workstation, I used an 800GB table. I think my test using SELECT * FROM t WHERE a < 0; with a table that has a padding column is likely going to be a more accurate test. Providing there is no rows with a < 0 in the table then the executor will spend almost all of the time in nodeSeqscan.c trying to find a row with a < 0. There's no additional overhead of aggregation doing the count(*). Having the additional padding column means that we read more data per evaluation of the a < 0 expression. Also, having a single column table is not that realistic. I'm pretty keen to see this machine running something closer to the test I mentioned in [1] but the benchmark query I mentioned in [2] with the "t" table being at least twice the size of RAM in the machine. Larger would be better though. With such a scaled test, I don't think there's much need to reboot the machine in between. Just run a single query first to warm up the cache before timing anything. Having the table a few times larger than RAM will mean that we can be certain that the disk was actually used during the test. The more data we can be certain came from disk the more we can trust that the results are meaningful. Thanks again for testing this. David [1] https://www.postgresql.org/message-id/caaphdvrfjfyh51_wy-iqqpw8ygr4fdotxaqkqn+sa7ntkev...@mail.gmail.com [2] https://www.postgresql.org/message-id/caaphdvo+legkmcavoipyk8nebgp-lrxns2tj1n_xnrjve9x...@mail.gmail.com
Re: Index Skip Scan (new UniqueKeys)
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > + currItem = >currPos.items[so->currPos.lastItem]; > > > + itup = (IndexTuple) (so->currTuples + > > > currItem->tupleOffset); > > > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); > > Do you mean this last part with t_tid, which could also have a tid array > in case of posting tuple format? Yeah. There is a TID array at the end of the index when the tuple is a posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't safe to assume that t_tid is a heap TID for this reason, even in code that only ever considers data items (that is, non high-key tuples AKA non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot be posting list tuples either, so you'll see the assumption that t_tid is just a heap TID in parts of nbtinsert.c -- though only for the new/incoming item.) > Well, it's obviously wrong, thanks for noticing. What is necessary is to > compare two index tuples, the start and the next one, to test if they're > the same (in which case if I'm not mistaken probably we can compare item > pointers). I've got this question when I was about to post a new version > with changes to address feedback from Andy, now I'll combine them and > send a cumulative patch. This sounds like approximately the same problem as the one that _bt_killitems() has to deal with as of Postgres 13. This is handled in a way that is admittedly pretty tricky, even though the code does not need to be 100% certain that it's "the same" tuple. Deduplication kind of makes that a fuzzy concept. In principle there could be one big index tuple instead of 5 tuples, even though the logical contents of the page have not been changed between the time we recording heap TIDs in local and the time _bt_killitems() tried to match on those heap TIDs to kill_prior_tuple-kill some index tuples -- a concurrent deduplication pass could do that. Your code needs to be prepared for stuff like that. Ultimately posting list tuples are just a matter of understanding the on-disk representation -- a "Small Matter of Programming". Even without deduplication there are potential hazards from the physical deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is not code that runs in VACUUM, despite the name). Make sure that you hold a buffer pin on the leaf page throughout, because you need to do that to make sure that VACUUM cannot concurrently recycle heap TIDs. If VACUUM *is* able to concurrently recycle heap TIDs then it'll be subtly broken. _bt_killitems() is safe because it either holds on to a pin or gives up when the LSN changes at all. (ISTM that your only choice is to hold on to a leaf page pin, since you cannot just decide to give up in the way that _bt_killitems() sometimes can.) Note that the rules surrounding buffer locks/pins for nbtree were tightened up a tiny bit today -- see commit 4a70f829. Also, it's no longer okay to use raw LockBuffer() calls in nbtree, so you're going to have to fix that up when you next rebase -- you must use the new _bt_lockbuf() wrapper function instead, so that the new Valgrind instrumentation is used. This shouldn't be hard. Perhaps you can use Valgrind to verify that this patch doesn't have any unsafe buffer accesses. I recall problems like that in earlier versions of the patch series. Valgrind should be able to detect most bugs like that (though see comments within _bt_lockbuf() for details of a bug in this area that Valgrind cannot detect even now). -- Peter Geoghegan
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
On Fri, Jul 17, 2020 at 5:53 PM Peter Geoghegan wrote: > Pushed the first patch just now, and intend to push the other one soon. > Thanks! Pushed the second piece of this (the nbtree patch) just now. Thanks for the review! -- Peter Geoghegan
Re: v13 planner ERROR: could not determine which collation to use for string comparison
Justin Pryzby writes: > We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039). > explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE sites.config_site_name > != sectors.sect_name ; > ERROR: could not determine which collation to use for string comparison > I can workaround the issue by DELETEing stats for either column. Ugh. It's clear from your stack trace that neqjoinsel() has forgotten to pass through collation to eqjoinsel(). Will fix. regards, tom lane
Re: Infinities in type numeric
I wrote: > Dean Rasheed writes: >> I had a look at this, and I think it's mostly in good shape. It looks >> like everything from the first message in this thread has been >> resolved, except I don't know about the jsonpath stuff, because I >> haven't been following that. > Thanks for the careful review! Here's a v4 that syncs numeric in_range() with the new behavior of float in_range(), and addresses your other comments too. regards, tom lane diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c index ed361efbe2..b81ba54b80 100644 --- a/contrib/jsonb_plperl/jsonb_plperl.c +++ b/contrib/jsonb_plperl/jsonb_plperl.c @@ -227,10 +227,8 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem) /* * jsonb doesn't allow infinity or NaN (per JSON * specification), but the numeric type that is used for the - * storage accepts NaN, so we have to prevent it here - * explicitly. We don't really have to check for isinf() - * here, as numeric doesn't allow it and it would be caught - * later, but it makes for a nicer error message. + * storage accepts those, so we have to reject them here + * explicitly. */ if (isinf(nval)) ereport(ERROR, diff --git a/contrib/jsonb_plpython/jsonb_plpython.c b/contrib/jsonb_plpython/jsonb_plpython.c index e09308daf0..836c178770 100644 --- a/contrib/jsonb_plpython/jsonb_plpython.c +++ b/contrib/jsonb_plpython/jsonb_plpython.c @@ -387,14 +387,17 @@ PLyNumber_ToJsonbValue(PyObject *obj, JsonbValue *jbvNum) pfree(str); /* - * jsonb doesn't allow NaN (per JSON specification), so we have to prevent - * it here explicitly. (Infinity is also not allowed in jsonb, but - * numeric_in above already catches that.) + * jsonb doesn't allow NaN or infinity (per JSON specification), so we + * have to reject those here explicitly. */ if (numeric_is_nan(num)) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("cannot convert NaN to jsonb"))); + if (numeric_is_inf(num)) + ereport(ERROR, +(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("cannot convert infinity to jsonb"))); jbvNum->type = jbvNumeric; jbvNum->val.numeric = num; diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 7027758d28..50e370cae4 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -554,9 +554,9 @@ NUMERIC(precision) NUMERIC - without any precision or scale creates a column in which numeric - values of any precision and scale can be stored, up to the - implementation limit on precision. A column of this kind will + without any precision or scale creates an unconstrained + numeric column in which numeric values of any length can be + stored, up to the implementation limits. A column of this kind will not coerce input values to any particular scale, whereas numeric columns with a declared scale will coerce input values to that scale. (The SQL standard @@ -568,10 +568,10 @@ NUMERIC - The maximum allowed precision when explicitly specified in the - type declaration is 1000; NUMERIC without a specified - precision is subject to the limits described in . + The maximum precision that can be explicitly specified in + a NUMERIC type declaration is 1000. An + unconstrained NUMERIC column is subject to the limits + described in . @@ -593,6 +593,11 @@ NUMERIC plus three to eight bytes overhead. + + infinity + numeric (data type) + + NaN not a number @@ -604,13 +609,44 @@ NUMERIC - In addition to ordinary numeric values, the numeric - type allows the special value NaN, meaning - not-a-number. Any operation on NaN - yields another NaN. When writing this value - as a constant in an SQL command, you must put quotes around it, - for example UPDATE table SET x = 'NaN'. On input, - the string NaN is recognized in a case-insensitive manner. + In addition to ordinary numeric values, the numeric type + has several special values: + +Infinity +-Infinity +NaN + + These are adapted from the IEEE 754 standard, and represent + infinity, negative infinity, and + not-a-number, respectively. When writing these values + as constants in an SQL command, you must put quotes around them, + for example UPDATE table SET x = '-Infinity'. + On input, these strings are recognized in a case-insensitive manner. + The infinity values can alternatively be spelled inf + and -inf. + + + + The infinity values behave as per mathematical expectations. For + example, Infinity plus any finite value equals + Infinity, as does Infinity + plus Infinity; but Infinity + minus Infinity yields NaN (not a + number), because it has no well-defined interpretation. Note
Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)
On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos wrote: > As a general overview, the series of patches in the mail thread do match > their description. The addition of the stricter, explicit use of > instrumentation does improve the design as the distinction of the use cases > requiring a pin or a lock is made more clear. The added commentary is > descriptive and appears grammatically correct, at least to a non native > speaker. I didn't see this review until now because it ended up in gmail's spam folder. :-( Thanks for taking a look at it! -- Peter Geoghegan
Re: Which SET TYPE don't actually require a rewrite
On Fri, Jul 17, 2020 at 11:26:56AM -0400, Tom Lane wrote: > Magnus Hagander writes: > > As Amit mentions it is also triggered by some store parameter changes. But > > not all. So looking at it the other way, the part that the end user really > > cares about it "which ALTER TABLE operations will rewrite the table and > > which will not". Maybe what we need is a section specifically on this that > > summarizes all the different ways that it can happen. > > No, what we need is EXPLAIN for DDL ;-). Trying to keep such > documentation in sync with the actual code behavior would be impossible. > (For one thing, some aspects can be affected by extension datatype > behaviors.) I know Tom put a wink on that, but I actually do feel that the only clean way to do this is to give users a way to issue the query in a non-executing way that will report if a rewrite is going to happen. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Improving psql slash usage help message
On 2020-07-21 17:10, Tom Lane wrote: Hamid Akhtar writes: So are you suggesting to not fix this or do a more detailed review and assess what other psql messages can be grouped together. I was just imagining merging the entries for the commands that are implemented by listTables(). If you see something else that would be worth doing, feel free to suggest it, but that wasn't what I was thinking of. It used to be like that, but it was changed here: 9491c82f7103d62824d3132b8c7dc44609f2f56b I'm not sure which way is better. Right now, a question like "how do I list all indexes" is easily answered by the help output. Under the other scheme, it's hidden behind a layer of metasyntax. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: OpenSSL randomness seeding
> On 21 Jul 2020, at 22:00, David Steele wrote: > > On 7/21/20 3:44 PM, Daniel Gustafsson wrote: >>> On 21 Jul 2020, at 17:31, David Steele wrote: >>> On 7/21/20 8:13 AM, Daniel Gustafsson wrote: Another thing that stood out when reviewing this code is that we optimize for RAND_poll failing in pg_strong_random, when we already have RAND_status checking for a sufficiently seeded RNG for us. ISTM that we can simplify the code by letting RAND_status do the work as per 0002, and also (while unlikely) survive any transient failures in RAND_poll by allowing all the retries we've defined for the loop. >>> >>> I wonder how effective the retries are going to be if they happen >>> immediately. However, most of the code paths I followed ended in a hard >>> error when pg_strong_random() failed so it may not hurt to try. I just >>> worry that some caller is depending on a faster failure here. >> There is that, but I'm not convinced that relying on specific timing for >> anything RNG or similarly cryptographic-related is especially sane. > > I wasn't thinking specific timing -- just that the caller might be expecting > it to give up quickly if it doesn't work. That's what the code is trying to > do and I wonder if there is a reason for it. I think the original intention was to handle older OpenSSL versions where multiple successful RAND_poll calls were required for RAND_status to succeed, the check working as an optimization since a failing RAND_poll would render all efforts useless anyway. I'm not sure this is true for the OpenSSL versions we support in HEAD, and/or for modern platforms, but without proof of it not being useful I would opt for keeping it. cheers ./daniel
Re: Default setting for enable_hashagg_disk
On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote: > Maybe I missed your point here. The problem is not so much that we'll > get HashAggs that spill -- there is nothing intrinsically wrong with > that. While it's true that the I/O pattern is not as sequential as a > similar group agg + sort, that doesn't seem like the really important > factor here. The really important factor is that in-memory HashAggs > can be blazingly fast relative to *any* alternative strategy -- be it > a HashAgg that spills, or a group aggregate + sort that doesn't spill, > whatever. We're mostly concerned about keeping the one available fast > strategy than we are about getting a new, generally slow strategy. Do we have any data that in-memory HashAggs are "blazingly fast relative to *any* alternative strategy?" The data I have tested myself and what I saw from Tomas was that spilling sort or spilling hash are both 2.5x slower. Are we sure the quoted statement is true? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: OpenSSL randomness seeding
On 7/21/20 3:44 PM, Daniel Gustafsson wrote: On 21 Jul 2020, at 17:31, David Steele wrote: On 7/21/20 8:13 AM, Daniel Gustafsson wrote: Another thing that stood out when reviewing this code is that we optimize for RAND_poll failing in pg_strong_random, when we already have RAND_status checking for a sufficiently seeded RNG for us. ISTM that we can simplify the code by letting RAND_status do the work as per 0002, and also (while unlikely) survive any transient failures in RAND_poll by allowing all the retries we've defined for the loop. I wonder how effective the retries are going to be if they happen immediately. However, most of the code paths I followed ended in a hard error when pg_strong_random() failed so it may not hurt to try. I just worry that some caller is depending on a faster failure here. There is that, but I'm not convinced that relying on specific timing for anything RNG or similarly cryptographic-related is especially sane. I wasn't thinking specific timing -- just that the caller might be expecting it to give up quickly if it doesn't work. That's what the code is trying to do and I wonder if there is a reason for it. But you are probably correct and I'm just overthinking it. Regards, -- -David da...@pgmasters.net
Re: v13 planner ERROR: could not determine which collation to use for string comparison
Reproducer: postgres=# CREATE TABLE t AS SELECT ''a FROM generate_series(1,99); CREATE TABLE u AS SELECT ''a FROM generate_series(1,99) ; VACUUM ANALYZE t,u; postgres=# explain SELECT * FROM t JOIN u ON t.a!=u.a; ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly.
Re: OpenSSL randomness seeding
> On 21 Jul 2020, at 17:31, David Steele wrote: > On 7/21/20 8:13 AM, Daniel Gustafsson wrote: >> Another thing that stood out when reviewing this code is that we optimize for >> RAND_poll failing in pg_strong_random, when we already have RAND_status >> checking for a sufficiently seeded RNG for us. ISTM that we can simplify the >> code by letting RAND_status do the work as per 0002, and also (while >> unlikely) >> survive any transient failures in RAND_poll by allowing all the retries we've >> defined for the loop. > > I wonder how effective the retries are going to be if they happen > immediately. However, most of the code paths I followed ended in a hard error > when pg_strong_random() failed so it may not hurt to try. I just worry that > some caller is depending on a faster failure here. There is that, but I'm not convinced that relying on specific timing for anything RNG or similarly cryptographic-related is especially sane. cheers ./daniel
Re: v13 planner ERROR: could not determine which collation to use for string comparison
On Tuesday, July 21, 2020, Justin Pryzby wrote: > We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039). > > explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE > sites.config_site_name != sectors.sect_name ; > ERROR: could not determine which collation to use for string comparison > > I can workaround the issue by DELETEing stats for either column. > > It's possible we're doing soemthing wrong and I need to revisit docs..but > this > was working in v12. > This sounds suspiciously like a side-effect of: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=022cd0bfd33968f2b004106cfeaa3b2951e7f322 David J.
v13 planner ERROR: could not determine which collation to use for string comparison
We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039). explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE sites.config_site_name != sectors.sect_name ; ERROR: could not determine which collation to use for string comparison I can workaround the issue by DELETEing stats for either column. It's possible we're doing soemthing wrong and I need to revisit docs..but this was working in v12. ts=# SELECT * FROM pg_stats WHERE tablename='sites' AND attname='config_site_name'; -[ RECORD 1 ]--+- schemaname | public tablename | sites attname| config_site_name inherited | f null_frac | 0 avg_width | 1 n_distinct | 1 most_common_vals | {""} most_common_freqs | {1} histogram_bounds | correlation| 1 most_common_elems | most_common_elem_freqs | elem_count_histogram | #1 0x00ab2993 in errfinish (filename=0xcaae40 "varlena.c", lineno=1476, funcname=0xcab7b0 <__func__.18296> "check_collation_set") at elog.c:502 #2 0x00a783ae in check_collation_set (collid=0) at varlena.c:1473 #3 0x00a78857 in texteq (fcinfo=0x7fff1ecae590) at varlena.c:1740 #4 0x00a4248c in eqjoinsel_inner (opfuncoid=67, collation=0, vardata1=0x7fff1ecae7a0, vardata2=0x7fff1ecae770, nd1=1, nd2=1, isdefault1=false, isdefault2=false, sslot1=0x7fff1ecae720, sslot2=0x7fff1ecae6e0, stats1=0x1a97c00, stats2=0x1a98230, have_mcvs1=true, have_mcvs2=true) at selfuncs.c:2466 #5 0x00a41f66 in eqjoinsel (fcinfo=0x7fff1ecae8a0) at selfuncs.c:2298 #6 0x00abb63c in DirectFunctionCall5Coll (func=0xa41caf , collation=0, arg1=28313248, arg2=98, arg3=28315832, arg4=0, arg5=140733710004032) at fmgr.c:908 #7 0x00a43197 in neqjoinsel (fcinfo=0x7fff1ecaea40) at selfuncs.c:2824 #8 0x00abc4a0 in FunctionCall5Coll (flinfo=0x7fff1ecaeb00, collation=100, arg1=28313248, arg2=531, arg3=28315832, arg4=0, arg5=140733710004032) at fmgr.c:1245 #9 0x00abcd1c in OidFunctionCall5Coll (functionId=106, collation=100, arg1=28313248, arg2=531, arg3=28315832, arg4=0, arg5=140733710004032) at fmgr.c:1463 #10 0x0084b2c2 in join_selectivity (root=0x1b006a0, operatorid=531, args=0x1b010b8, inputcollid=100, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at plancat.c:1822 #11 0x007dba29 in clause_selectivity (root=0x1b006a0, clause=0x1b01168, varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at clausesel.c:765 #12 0x007dacf4 in clauselist_selectivity_simple (root=0x1b006a0, clauses=0x1b05fe8, varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40, estimatedclauses=0x0) at clausesel.c:169 #13 0x007dac33 in clauselist_selectivity (root=0x1b006a0, clauses=0x1b05fe8, varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at clausesel.c:102 #14 0x007e44e3 in calc_joinrel_size_estimate (root=0x1b006a0, joinrel=0x1b02ce0, outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, outer_rows=311, inner_rows=1047, sjinfo=0x7fff1ecaef40, restrictlist_in=0x1b05de0) at costsize.c:4857 #15 0x007e41eb in set_joinrel_size_estimates (root=0x1b006a0, rel=0x1b02ce0, outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, sjinfo=0x7fff1ecaef40, restrictlist=0x1b05de0) at costsize.c:4712 #16 0x008507a6 in build_join_rel (root=0x1b006a0, joinrelids=0x1b05c08, outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, sjinfo=0x7fff1ecaef40, restrictlist_ptr=0x7fff1ecaef38) at relnode.c:728 #17 0x007f5ecb in make_join_rel (root=0x1b006a0, rel1=0x1afd4f0, rel2=0x1b01cf0) at joinrels.c:746 #18 0x007f542e in make_rels_by_clause_joins (root=0x1b006a0, old_rel=0x1afd4f0, other_rels_list=0x1b05d08, other_rels=0x1b05d28) at joinrels.c:312 #19 0x007f4f04 in join_search_one_level (root=0x1b006a0, level=2) at joinrels.c:123 #20 0x007d96a5 in standard_join_search (root=0x1b006a0, levels_needed=2, initial_rels=0x1b05d08) at allpaths.c:3097 #21 0x007d961e in make_rel_from_joinlist (root=0x1b006a0, joinlist=0x1b03b28) at allpaths.c:3028 #22 0x007d4f82 in make_one_rel (root=0x1b006a0, joinlist=0x1b03b28) at allpaths.c:227 #23 0x0080f835 in query_planner (root=0x1b006a0, qp_callback=0x816525 , qp_extra=0x7fff1ecaf320) at planmain.c:269 #24 0x00813406 in grouping_planner (root=0x1b006a0, inheritance_update=false, tuple_fraction=0) at planner.c:2058 #25 0x008115b7 in subquery_planner (glob=0x1b00588, parse=0x1afdc48, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1015 #26 0x0080fe34 in standard_planner (parse=0x1afdc48, query_string=0x1938e90 "explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE sites. config_site_name != sectors.sect_name ;", cursorOptions=256, boundParams=0x0) at planner.c:405
Re: xl_heap_header alignment?
Andres Freund writes: > On July 21, 2020 10:45:37 AM PDT, Antonin Houska wrote: >> I don't quite understand this part of the comment of the xl_heap_header >> structure: >> * NOTE: t_hoff could be recomputed, but we may as well store it because >> * it will come for free due to alignment considerations. > Unless you declare them as packed, structs will add padding to align members > correctly (if, and only if, the whole struct is stored well aligned). I think that comment may be out of date, because what's there now is * NOTE: t_hoff could be recomputed, but we may as well store it because * it will come for free due to alignment considerations. */ typedef struct xl_heap_header { uint16 t_infomask2; uint16 t_infomask; uint8 t_hoff; } xl_heap_header; I find it hard to see how tacking t_hoff onto what would have been a 4-byte struct is "free". Maybe sometime in the dim past there was another field in this struct? (But I checked back as far as 7.4 without finding one.) I don't particularly want to remove the field, but we ought to change or remove the comment. regards, tom lane
Re: xl_heap_header alignment?
Hi, On July 21, 2020 10:45:37 AM PDT, Antonin Houska wrote: >I don't quite understand this part of the comment of the xl_heap_header >structure: > >* NOTE: t_hoff could be recomputed, but we may as well store it because > * it will come for free due to alignment considerations. > >What are the alignment considerations? The WAL code does not appear to >assume >any alignment, and therefore it uses memcpy() to copy the structure >into a >local variable before accessing its fields. For example, >heap_xlog_insert(). Unless you declare them as packed, structs will add padding to align members correctly (if, and only if, the whole struct is stored well aligned). Regards, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
xl_heap_header alignment?
I don't quite understand this part of the comment of the xl_heap_header structure: * NOTE: t_hoff could be recomputed, but we may as well store it because * it will come for free due to alignment considerations. What are the alignment considerations? The WAL code does not appear to assume any alignment, and therefore it uses memcpy() to copy the structure into a local variable before accessing its fields. For example, heap_xlog_insert(). -- Antonin Houska Web: https://www.cybertec-postgresql.com
RE: Parallel Seq Scan vs kernel read ahead
On Friday, July 17, 2020 6:18 PM (GMT+9), Amit Kapila wrote: > On Fri, Jul 17, 2020 at 11:35 AM k.jami...@fujitsu.com > wrote: > > > > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote: > > > > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila wrote: > > >> > > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley > wrote: > > >>> If we've not seen any performance regressions within 1 week, then > > >>> I propose that we (pending final review) push this to allow wider > > >>> testing. > > >> > > >> I think Soumyadeep has reported a regression case [1] with the > > >> earlier version of the patch. I am not sure if we have verified > > >> that the situation improves with the latest version of the patch. > > >> I request Soumyadeep to please try once with the latest patch. > > >... > > >Yeah, it would be good to see some more data points on that test. > > >Jumping from 2 up to 6 workers just leaves us to guess where the > > >performance started to become bad. >It would be good to know if the > > >regression is repeatable or if it was affected by some other process. > > >... > > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET > > >track_io_timing = on; for each value of >max_parallel_workers. > > > > Hi, > > > > If I'm following the thread correctly, we may have gains on this patch > > of Thomas and David, but we need to test its effects on different > > filesystems. It's also been clarified by David through benchmark tests > > that synchronize_seqscans is not affected as long as the set cap per > > chunk size of parallel scan is at 8192. > > > > I also agree that having a control on this through GUC can be > > beneficial for users, however, that can be discussed in another thread > > or development in the future. > > > > David Rowley wrote: > > >I'd like to propose that if anyone wants to do further testing on > > >other operating systems with SSDs or HDDs then it would be good if > > >that could be done within a 1 week from this email. There are various > > >benchmarking ideas on this thread for inspiration. > > > > I'd like to join on testing it, this one using HDD, and at the bottom > > are the results. Due to my machine limitations, I only tested > > 0~6 workers, that even if I increase max_parallel_workers_per_gather > > more than that, the query planner would still cap the workers at 6. > > I also set the track_io_timing to on as per David's recommendation. > > > > Tested on: > > XFS filesystem, HDD virtual machine > > RHEL4, 64-bit, > > 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on > > x86_64-pc-linux-gnu > > > > > > Test Case (Soumyadeep's) [1] > > > > shared_buffers = 32MB (to use OS page cache) > > > > create table t_heap as select generate_series(1, 1) i; --about > 3.4GB size > > > > SET track_io_timing = on; > > > > \timing > > > > set max_parallel_workers_per_gather = 0; --0 to 6 > > > > SELECT count(*) from t_heap; > > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; > > > > [Summary] > > I used the same query from the thread. However, the sql query > > execution time and query planner execution time results between the > > master and patched do not vary much. > > OTOH, in terms of I/O stats, I observed similar regression in both > > master and patched as we increase max_parallel_workers_per_gather. > > > > It could also be possible that each benchmark setting for > > max_parallel_workers_per_gather is affected by previous result . IOW, > > later benchmark runs benefit from the data cached by previous runs on OS > > level. > > > > Yeah, I think to some extent that is visible in results because, after patch, > at 0 > workers, the execution time is reduced significantly whereas there is not much > difference at other worker counts. I think for non-parallel case (0 workers), > there shouldn't be any difference. > Also, I am not sure if there is any reason why after patch the number of > shared hits > is improved, probably due to caching effects? > > > Any advice? > > I think recreating the database and restarting the server after each run > might help > in getting consistent results. Also, you might want to take median of three > runs. Thank you for the advice. I repeated the test as per your advice and average of 3 runs per worker/s planned. It still shows the following similar performance results between Master and Patch V2. I wonder why there's no difference though. The test on my machine is roughly like this: createdb test psql -d test create table t_heap as select generate_series(1, 1) i; \q pg_ctl restart psql -d test SET track_io_timing = on; SET max_parallel_workers_per_gather = 0; SHOW max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap; \timing SELECT count(*) from t_heap; drop table t_heap; \q dropdb test pg_ctl restart Below are the results. Again, almost no discernible difference between the master and patch. Also, the results when
RE: Parallel Seq Scan vs kernel read ahead
On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote: > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com > wrote: > > > > Thank you for the advice. I repeated the test as per your advice and > > average of 3 runs per worker/s planned. > > It still shows the following similar performance results between Master and > Patch V2. > > I wonder why there's no difference though. > > > > The test on my machine is roughly like this: > > > > createdb test > > psql -d test > > create table t_heap as select generate_series(1, 1) i; \q > > > > pg_ctl restart > > psql -d test > > SET track_io_timing = on; > > SET max_parallel_workers_per_gather = 0; SHOW > > max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT > > count(*) from t_heap; \timing SELECT count(*) from t_heap; > > > > drop table t_heap; > > \q > > dropdb test > > pg_ctl restart > > > > Below are the results. Again, almost no discernible difference between the > master and patch. > > Also, the results when max_parallel_workers_per_gather is more than 4 > > could be inaccurate due to my machine's limitation of only having v4 > > CPUs. Even so, query planner capped it at > > 6 workers. > > > > Query Planner I/O Timings (track_io_timing = on) in ms : > > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | > > | I/O WRITE (Patch) | > > > ||---|--||- > --| > > | 0 | "1,130.777" | "1,250.821" | "01,698.051" | > "01,733.439" | > > | 1 | "1,603.016" | "1,660.767" | "02,312.248" | > "02,291.661" | > > | 2 | "2,036.269" | "2,107.066" | "02,698.216" | > "02,796.893" | > > | 3 | "2,298.811" | "2,307.254" | "05,695.991" | > "05,894.183" | > > | 4 | "2,098.642" | "2,135.960" | "23,837.088" | > "26,537.158" | > > | 5 | "1,956.536" | "1,997.464" | "45,891.851" | > "48,049.338" | > > | 6 | "2,201.816" | "2,219.001" | "61,937.828" | > "67,809.486" | > > > > Query Planner Execution Time (ms): > > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) | > > ||---|--| > > | 0.000 | "40,454.252" | "40,521.578" | > > | 1.000 | "21,332.067" | "21,205.068" | > > | 2.000 | "14,266.756" | "14,385.539" | > > | 3.000 | "11,597.936" | "11,722.055" | > > | 4.000 | "12,937.468" | "13,439.247" | > > | 5.000 | "14,383.083" | "14,782.866" | > > | 6.000 | "14,671.336" | "15,507.581" | > > > > Based from the results above, the I/O latency increases as number of > > workers also increase. Despite that, the query planner execution time > > is almost closely same when 2 or more workers are used (14~11s). Same > results between Master and Patch V2. > > > > As for buffers, same results are shown per worker (both Master and Patch). > > | Worker | Buffers | > > ||--| > > | 0 | shared read=442478 dirtied=442478 written=442446 | > > | 1 | shared read=442478 dirtied=442478 written=442414 | > > | 2 | shared read=442478 dirtied=442478 written=442382 | > > | 3 | shared read=442478 dirtied=442478 written=442350 | > > | 4 | shared read=442478 dirtied=442478 written=442318 | > > | 5 | shared read=442478 dirtied=442478 written=442286 | > > | 6 | shared read=442478 dirtied=442478 written=442254 | > > > > > > SQL Query Execution Time (ms) : > > | Worker | SQL (Master) | SQL (Patch) | > > ||--|--| > > | 0 | "10,418.606" | "10,377.377" | > > | 1 | "05,427.460" | "05,402.727" | > > | 2 | "03,662.998" | "03,650.277" | > > | 3 | "02,718.837" | "02,692.871" | > > | 4 | "02,759.802" | "02,693.370" | > > | 5 | "02,761.834" | "02,682.590" | > > | 6 | "02,711.434" | "02,726.332" | > > > > The SQL query execution time definitely benefitted from previous run > > of query planner, so the results are faster. But again, both Master and > > Patched > have almost the same results. > > Nonetheless, the execution time is almost consistent when > > max_parallel_workers_per_gather is 2 (default) and above. > > > > I am definitely missing something. Perhaps I think I could not > > understand why there's no I/O difference between the Master and > > Patched (V2). Or has it been already improved even without this patch? > > > > I don't think it is strange that you are not seeing much difference because > as per > the initial email by Thomas this patch is not supposed to give benefits on all > systems. I think we wanted to check that the patch should not regress > performance in cases where it doesn't give benefits. I think it might be > okay to > run
Re: pg_subscription.subslotname is wrongly marked NOT NULL
I wrote: > * On the other side of the ledger, if we don't fix these markings > we cannot back-patch the additional assertions I proposed at [1]. > I'm kind of leaning to committing this as shown and back-patching > the patch at [1], but certainly a case could be made in the other > direction. Thoughts? After further thought about that I realized that the assertion patch could be kluged in the same way as we did in llvmjit_deform.c, and that that would really be the only safe way to do it pre-v13. Otherwise the assertions would trip in pre-existing databases, which would not be nice. So what I've done is to back-patch the assertions that way, and *not* apply BKI_FORCE_NULL in the back branches. The possible downsides of doing that seem to outweigh the upside of making the catalog state cleaner in new installations. regards, tom lane
Re: Improve handling of pg_stat_statements handling of bind "IN" variables
> On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov > wrote: > >> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane wrote: >> >> Greg Stark writes: >> > Actually thinking about this for two more seconds the question is what it >> > would do with a query like >> > WHERE col = ANY '1,2,3'::integer[] >> > Or >> > WHERE col = ANY ARRAY[1,2,3] >> > Whichever the language binding that is failing to do parameterized queries >> > is doing to emulate them. >> >> Yeah, one interesting question is whether this is actually modeling >> what happens with real-world applications --- are they sending Consts, >> or Params? >> >> I resolutely dislike the idea of marking arrays that came from IN >> differently from other ones; that's just a hack and it's going to give >> rise to unexplainable behavioral differences for logically-equivalent >> queries. > > Thanks for your input. > > As for real-world applications – being a founder of a server monitoring saas > (okmeter) I have access to stats on hundreds of postgres installations. > > It shows that IN with a variable number of params is ~7 times more used than > ANY(array). Hi, I would like to do some archaeology and inquire about this thread, since unfortunately there was no patch presented as far as I see. IIUC the ideas suggested in this thread are evolving mostly about modifying parser: > On Fri, Jun 14, 2019 at 2:46 AM Tom Lane wrote: > > I do not think you need new expression infrastructure. IMO what's going on > here is that we're indulging in premature optimization in the parser. It > would be better from a structural standpoint if the output of parse analysis > were closer to what the user wrote, and then the business of separating Vars > from Consts and reducing the Consts to an array were handled in the planner's > expression preprocessing phase. > > So maybe what you should be thinking about is a preliminary patch that's > mostly in the nature of refactoring, to move that processing to where it > should be. > > Of course, life is never quite that simple; there are at least two > issues you'd have to think about. > > * The parsing phase is responsible for determining the semantics of > the query, in particular resolving the data types of the IN-list items > and choosing the comparison operators that will be used. The planner > is not allowed to rethink that. What I'm not clear about offhand is > whether the existing coding in parse analysis might lead to different > choices of data types/operators than a more straightforward approach > does. If so, we'd have to decide whether that's okay. > > * Postponing this work might make things slower overall, which wouldn't > matter too much for short IN-lists, but you can bet that people who > throw ten-thousand-entry IN-lists at us will notice. So you'd need to > keep an eye on efficiency and make sure you don't end up repeating > similar processing over and over. This puzzles me, since the original issue sounds like a "representation" problem, when we want to calculate jumble hash in a way that obviously repeating parameters or constants are hashed into one value. I see the point in ideas like this: >> One idea that comes to me after looking at the code involved is that >> it might be an improvement across-the-board if transformAExprIn were to >> reduce the generated ArrayExpr to an array Const immediately, in cases >> where all the inputs are Consts. That is going to happen anyway come >> plan time, so it'd have zero impact on semantics or query performance. >> Doing it earlier would cost nothing, and could even be a net win, by >> reducing per-parse-node overhead in places like the rewriter. >> >> The advantage for the problem at hand is that a Const that's an array >> of 2 elements is going to look the same as a Const that's any other >> number of elements so far as pg_stat_statements is concerned. >> >> This doesn't help of course in cases where the values aren't all >> Consts. Since we eliminated Vars already, the main practical case >> would be that they're Params, leading us back to the previous >> question of whether apps are binding queries with different numbers >> of parameter markers in an IN, and how hard pg_stat_statements should >> try to fuzz that if they are. >> >> Then, to Greg's point, there's a question of whether transformArrayExpr >> should do likewise, ie try to produce an array Const immediately. >> I'm a bit less excited about that, but consistency suggests that >> we should have it act the same as the IN case. Interestingly enough, something similar was already mentioned in [1]. But no one jumped into this, probably due to its relative complexity, lack of personal time resources or not clear way to handle Params (I'm actually not sure about the statistics for Consts vs Params myself and need to check this, but can easily imagine both could be an often problem). Another idea also was mentioned in [1]: > I wonder whether we could improve this by arranging things so that both > Consts and
Re: OpenSSL randomness seeding
On 7/21/20 8:13 AM, Daniel Gustafsson wrote: After forking we call RAND_cleanup in fork_process.c to force a re-seed to ensure that two backends cannot share sequence. OpenSSL 1.1.0 deprecated RAND_cleanup, and contrary to how they usually leave deprecated APIs working until removed, they decided to silently make this call a noop like below: # define RAND_cleanup() while(0) continue This leaves our defence against pool sharing seemingly useless, and also against the recommendations of OpenSSL for versions > 1.1.0 and < 1.1.1 where the RNG was rewritten: https://wiki.openssl.org/index.php/Random_fork-safety The silver lining here is that while OpenSSL nooped RAND_cleanup, they also changed what is mixed into seeding so we are still not sharing a sequence. To fix this, changing the RAND_cleanup call to RAND_poll should be enough to ensure re-seeding after forking across all supported OpenSSL versions. Patch 0001 implements this along with a comment referencing when it can be removed (which most likely won't be for quite some time). This looks reasonable to me based on your explanation and the OpenSSL wiki. Another thing that stood out when reviewing this code is that we optimize for RAND_poll failing in pg_strong_random, when we already have RAND_status checking for a sufficiently seeded RNG for us. ISTM that we can simplify the code by letting RAND_status do the work as per 0002, and also (while unlikely) survive any transient failures in RAND_poll by allowing all the retries we've defined for the loop. I wonder how effective the retries are going to be if they happen immediately. However, most of the code paths I followed ended in a hard error when pg_strong_random() failed so it may not hurt to try. I just worry that some caller is depending on a faster failure here. Regards, -- -David da...@pgmasters.net
Re: Improving psql slash usage help message
Hamid Akhtar writes: > So are you suggesting to not fix this or do a more detailed review and > assess what other psql messages can be grouped together. I was just imagining merging the entries for the commands that are implemented by listTables(). If you see something else that would be worth doing, feel free to suggest it, but that wasn't what I was thinking of. regards, tom lane
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar wrote: > > On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar wrote: > > > > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund wrote: > > > > > > Hi, > > > > > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > > > > The attached patch allows the vacuum to continue by emitting WARNING > > > > for the corrupted tuple instead of immediately error out as discussed > > > > at [1]. > > > > > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > > > control whether to continue the vacuum or to stop on the occurrence of > > > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > > > detected, it will emit a warning and return that nothing is changed in > > > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > > > Since we are returning false the caller will not try to freeze such > > > > tuple and the tuple_totally_frozen is also set to false so that the > > > > page will not be marked to all frozen even if all other tuples in the > > > > page are frozen. > > > > > > I'm extremely doubtful this is a good idea. In all likelihood this will > > > just exascerbate corruption. > > > > > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > > > getting *further* out of sync with the actual table contents. And you > > > cannot just freeze such tuples, because that has a good chance of making > > > deleted tuples suddenly visible, leading to unique constraint violations > > > etc. Which will then subsequently lead to clog lookup errors and such. > > > > I agree with the point. But, if we keep giving the ERROR in such > > cases then also the situation is not any better. Basically, we are > > not freezing such tuple as well as we can not advance the > > relfrozenxid. So if we follow the same rule that we don't freeze > > those tuples and also don't advance the relfrozenxid. The only > > difference is, allow the vacuum to continue with other tuples. > > > > > At the very least you'd need to signal up that relfrozenxid/relminmxid > > > cannot be increased. Without that I think it's entirely unacceptable to > > > do this. > > > > I agree with that point. I was just confused that shall we disallow > > to advance the relfrozenxid in all such cases or in some cases where > > the xid already precedes the relfrozenxid, we can allow it to advance > > as it can not become any worse. But, as Alvaro pointed out that there > > is no point in optimizing such cases. I will update the patch to > > stop advancing the relfrozenxid if we find any corrupted xid during > > tuple freeze. > > > > > If we really were to do something like this the option would need to be > > > called vacuum_allow_making_corruption_worse or such. Its need to be > > > *exceedingly* clear that it will likely lead to making everything much > > > worse. > > > > > Maybe we can clearly describe this in the document. > > > > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > > > frz->t_infomask = tuple->t_infomask; > > > > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); > > > > > > I don't think it can be right to just update heap_prepare_freeze_tuple() > > > but not FreezeMultiXactId(). > > > > oh, I missed this part. I will fix it. > > Please find the updated patch. In this version, we don't allow the > relfrozenxid and relminmxid to advance if the corruption is detected > and also added the handling in FreezeMultiXactId. In the previous version, the feature was enabled for cluster/vacuum full command as well. in the attached patch I have enabled it only if we are running vacuum command. It will not be enabled during a table rewrite. If we think that it should be enabled for the 'vacuum full' then we might need to pass a flag from the cluster_rel, all the way down to the heap_freeze_tuple. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch Description: Binary data
Re: Postgres-native method to identify if a tuple is frozen
On Mon, Jul 20, 2020 at 9:07 PM Lawrence Jones wrote: > > > So we hit the question: how can we identify if a tuple is frozen? I know the > tuple has both committed and aborted hint bits set, but accessing those bits > seems to require superuser functions and are unlikely to be that fast. > > Are there system columns (similar to xmin, tid, cid) that we don't know about? > I think the way to get that information is to use pageinspect extension and use some query like below but you are right that you need superuser privilege for that: SELECT t_ctid, raw_flags, combined_flags FROM heap_page_items(get_raw_page('pg_class', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL; -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
OpenSSL randomness seeding
After forking we call RAND_cleanup in fork_process.c to force a re-seed to ensure that two backends cannot share sequence. OpenSSL 1.1.0 deprecated RAND_cleanup, and contrary to how they usually leave deprecated APIs working until removed, they decided to silently make this call a noop like below: # define RAND_cleanup() while(0) continue This leaves our defence against pool sharing seemingly useless, and also against the recommendations of OpenSSL for versions > 1.1.0 and < 1.1.1 where the RNG was rewritten: https://wiki.openssl.org/index.php/Random_fork-safety The silver lining here is that while OpenSSL nooped RAND_cleanup, they also changed what is mixed into seeding so we are still not sharing a sequence. To fix this, changing the RAND_cleanup call to RAND_poll should be enough to ensure re-seeding after forking across all supported OpenSSL versions. Patch 0001 implements this along with a comment referencing when it can be removed (which most likely won't be for quite some time). Another thing that stood out when reviewing this code is that we optimize for RAND_poll failing in pg_strong_random, when we already have RAND_status checking for a sufficiently seeded RNG for us. ISTM that we can simplify the code by letting RAND_status do the work as per 0002, and also (while unlikely) survive any transient failures in RAND_poll by allowing all the retries we've defined for the loop. Also, as a disclaimer, this was brought up with the PostgreSQL security team first whom have given permission to discuss this in public. Thoughts on these? cheers ./daniel 0002-Remove-optimization-for-RAND_poll-failing.patch Description: Binary data 0001-Use-RAND_poll-for-seeding-randomness-after-fork.patch Description: Binary data -- VMware
Comment referencing incorrect algorithm
While poking around our crypto code, I noticed that a comment in sha2.h was referencing sha-1 which is an algorithm not supported by the code. The attached fixes the comment aligning it with other comments in the file. cheers ./daniel sha1_comment.diff Description: Binary data
Re: Improving psql slash usage help message
So are you suggesting to not fix this or do a more detailed review and assess what other psql messages can be grouped together. On Sun, Jul 12, 2020 at 8:15 PM Tom Lane wrote: > Hamid Akhtar writes: > > On Sun, Apr 26, 2020 at 1:03 AM David G. Johnston < > > david.g.johns...@gmail.com> wrote: > >> On Sat, Apr 25, 2020 at 12:29 PM Hamid Akhtar > >> wrote: > >>> "\dE" displays the list with a "List of relations" heading whereas > "\det" > >>> displays "List of foreign tables". So, to differentiate the two, I > suggest > >>> to change the help message for "\dE" to: > >>> \dE[S+] [PATTERN] list foreign relations > > >> help.c and the documentation need to be synchronized a bit more than > this > >> single issue. > >> Calling it "foreign relation" for \dE and "foreign table" for \det does > >> convey that there is a difference - not sure it a huge improvement > though. > >> The "\d[Eimstv]" family of meta-commands should, in the help, probably > be > >> moved together to show the fact that they are basically "list relation > >> names [of this type only]" while "\det" is "list foreign table info". > > FWIW, I agree with David on this point. I find it bizarre and unhelpful > that slashUsage shows \dt, \di, etc as separate commands and fails to > indicate that they can be combined. We could merge these entries into > > fprintf(output, _(" \\d[tivmsE][S+] [PATRN] list relations of > specified type(s)\n")); > > which would both remind people that they can give more than one type, > and shorten the already-overly-long list. > > > I think from a user perspective, grouping these wouldn't be helpful. For > > example, it may cause FDW related commands to be spread through out the > > help. > > That seems quite irrelevant to this proposal. > > regards, tom lane > -- Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca ADDR: 10318 WHALLEY BLVD, Surrey, BC CELL:+923335449950 EMAIL: mailto:hamid.akh...@highgo.ca SKYPE: engineeredvirus
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar wrote: > > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund wrote: > > > > Hi, > > > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > > > The attached patch allows the vacuum to continue by emitting WARNING > > > for the corrupted tuple instead of immediately error out as discussed > > > at [1]. > > > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > > control whether to continue the vacuum or to stop on the occurrence of > > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > > detected, it will emit a warning and return that nothing is changed in > > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > > Since we are returning false the caller will not try to freeze such > > > tuple and the tuple_totally_frozen is also set to false so that the > > > page will not be marked to all frozen even if all other tuples in the > > > page are frozen. > > > > I'm extremely doubtful this is a good idea. In all likelihood this will > > just exascerbate corruption. > > > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > > getting *further* out of sync with the actual table contents. And you > > cannot just freeze such tuples, because that has a good chance of making > > deleted tuples suddenly visible, leading to unique constraint violations > > etc. Which will then subsequently lead to clog lookup errors and such. > > I agree with the point. But, if we keep giving the ERROR in such > cases then also the situation is not any better. Basically, we are > not freezing such tuple as well as we can not advance the > relfrozenxid. So if we follow the same rule that we don't freeze > those tuples and also don't advance the relfrozenxid. The only > difference is, allow the vacuum to continue with other tuples. > > > At the very least you'd need to signal up that relfrozenxid/relminmxid > > cannot be increased. Without that I think it's entirely unacceptable to > > do this. > > I agree with that point. I was just confused that shall we disallow > to advance the relfrozenxid in all such cases or in some cases where > the xid already precedes the relfrozenxid, we can allow it to advance > as it can not become any worse. But, as Alvaro pointed out that there > is no point in optimizing such cases. I will update the patch to > stop advancing the relfrozenxid if we find any corrupted xid during > tuple freeze. > > > If we really were to do something like this the option would need to be > > called vacuum_allow_making_corruption_worse or such. Its need to be > > *exceedingly* clear that it will likely lead to making everything much > > worse. > > > Maybe we can clearly describe this in the document. > > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > > frz->t_infomask = tuple->t_infomask; > > > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); > > > > I don't think it can be right to just update heap_prepare_freeze_tuple() > > but not FreezeMultiXactId(). > > oh, I missed this part. I will fix it. Please find the updated patch. In this version, we don't allow the relfrozenxid and relminmxid to advance if the corruption is detected and also added the handling in FreezeMultiXactId. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch Description: Binary data
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com wrote: > > On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote: > > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com > > > > wrote: > > > > > > I am definitely missing something. Perhaps I think I could not > > > understand why there's no I/O difference between the Master and > > > Patched (V2). Or has it been already improved even without this patch? > > > > > > > I don't think it is strange that you are not seeing much difference because > > as per > > the initial email by Thomas this patch is not supposed to give benefits on > > all > > systems. I think we wanted to check that the patch should not regress > > performance in cases where it doesn't give benefits. I think it might be > > okay to > > run with a higher number of workers than you have CPUs in the system as we > > wanted to check if such cases regress as shown by Soumyadeep above [1]. Can > > you once try with > > 8 and or 10 workers as well? > > > > You are right. Kindly excuse me on that part, which only means it may or may > not have any > benefits on the filesystem I am using. But for other fs, as we can see from > David's benchmarks > significant results/benefits. > > Following your advice on regression test case, I increased the number of > workers, > but the query planner still capped the workers at 6, so the results from 6 > workers onwards > are almost the same. > I am slightly confused if the number of workers are capped at 6, then what exactly the data at 32 worker count means? If you want query planner to choose more number of workers, then I think either you need to increase the data or use Alter Table Set (parallel_workers = ); > I don't see significant difference between master and patched on my machine > as per my test results below. (Just for reconfirmation) > I see the difference of about 7-8% at higher (32) client-count. Am, I missing something? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Jul 17, 2020 at 2:09 PM vignesh C wrote: > > > > > Please find the updated patch with the fixes included. > > > > Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch > had few indentation issues, I have fixed and attached the patch for > the same. > Ensure to use the version with each patch-series as that makes it easier for the reviewer to verify the changes done in the latest version of the patch. One way is to use commands like "git format-patch -6 -v " or you can add the version number manually. Review comments: === 0001-Copy-code-readjustment-to-support-parallel-copy 1. @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate) else nbytes = 0; /* no data need be saved */ + if (cstate->copy_dest == COPY_NEW_FE) + minread = RAW_BUF_SIZE - nbytes; + inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, - 1, RAW_BUF_SIZE - nbytes); + minread, RAW_BUF_SIZE - nbytes); No comment to explain why this change is done? 0002-Framework-for-leader-worker-in-parallel-copy 2. + * ParallelCopyLineBoundary is common data structure between leader & worker, + * Leader process will be populating data block, data block offset & the size of + * the record in DSM for the workers to copy the data into the relation. + * This is protected by the following sequence in the leader & worker. If they + * don't follow this order the worker might process wrong line_size and leader + * might populate the information which worker has not yet processed or in the + * process of processing. + * Leader should operate in the following order: + * 1) check if line_size is -1, if not wait, it means worker is still + * processing. + * 2) set line_state to LINE_LEADER_POPULATING. + * 3) update first_block, start_offset & cur_lineno in any order. + * 4) update line_size. + * 5) update line_state to LINE_LEADER_POPULATED. + * Worker should operate in the following order: + * 1) check line_state is LINE_LEADER_POPULATED, if not it means leader is still + * populating the data. + * 2) read line_size. + * 3) only one worker should choose one line for processing, this is handled by + *using pg_atomic_compare_exchange_u32, worker will change the sate to + *LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. + * 4) read first_block, start_offset & cur_lineno in any order. + * 5) process line_size data. + * 6) update line_size to -1. + */ +typedef struct ParallelCopyLineBoundary Are we doing all this state management to avoid using locks while processing lines? If so, I think we can use either spinlock or LWLock to keep the main patch simple and then provide a later patch to make it lock-less. This will allow us to first focus on the main design of the patch rather than trying to make this datastructure processing lock-less in the best possible way. 3. + /* + * Actual lines inserted by worker (some records will be filtered based on + * where condition). + */ + pg_atomic_uint64 processed; + pg_atomic_uint64 total_worker_processed; /* total processed records by the workers */ The difference between processed and total_worker_processed is not clear. Can we expand the comments a bit? 4. + * SerializeList - Insert a list into shared memory. + */ +static void +SerializeList(ParallelContext *pcxt, int key, List *inputlist, + Size est_list_size) +{ + if (inputlist != NIL) + { + ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo *)shm_toc_allocate(pcxt->toc, + est_list_size); + CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo); + shm_toc_insert(pcxt->toc, key, sharedlistinfo); + } +} Why do we need to write a special mechanism (CopyListSharedMemory) to serialize a list. Why can't we use nodeToString? It should be able to take care of List datatype, see outNode which is called from nodeToString. Once you do that, I think you won't need even EstimateLineKeysList, strlen should work instead. Check, if you have any similar special handling for other types that can be dealt with nodeToString? 5. + MemSet(shared_info_ptr, 0, est_shared_info); + shared_info_ptr->is_read_in_progress = true; + shared_info_ptr->cur_block_pos = -1; + shared_info_ptr->full_transaction_id = full_transaction_id; + shared_info_ptr->mycid = GetCurrentCommandId(true); + for (count = 0; count < RINGSIZE; count++) + { + ParallelCopyLineBoundary *lineInfo = _info_ptr->line_boundaries.ring[count]; + pg_atomic_init_u32(&(lineInfo->line_size), -1); + } + You can move this initialization in a separate function. 6. In function BeginParallelCopy(), you need to keep a provision to collect wal_usage and buf_usage stats. See _bt_begin_parallel for reference. Those will be required for pg_stat_statements. 7. DeserializeString() -- it is better to name this function as RestoreString. ParallelWorkerInitialization() -- it is better to name this function as InitializeParallelCopyInfo or something like that, the current name is quite confusing. ParallelCopyLeader() -- how about
Re: Auto-vectorization speeds up multiplication of large-precision numerics
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar wrote: > I tried this in utils/adt/Makefile : > + > +numeric.o: CFLAGS += ${CFLAGS_VECTOR} > + > and it works. > > CFLAGS_VECTOR also includes the -funroll-loops option, which I > believe, had showed improvements in the checksum.c runs ( [1] ). This > option makes the object file a bit bigger. For numeric.o, it's size > increased by 15K; from 116672 to 131360 bytes. I ran the > multiplication test, and didn't see any additional speed-up with this > option. Also, it does not seem to be related to vectorization. So I > was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and > CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and > numeric.c can use only CFLAGS_VECTOR. I did as above. Attached is the v2 patch. In case of existing CFLAGS_VECTOR, an env variable also could be set by that name when running configure. I did the same for CFLAGS_UNROLL_LOOPS. Now, developers who already are using CFLAGS_VECTOR env while configur'ing might be using this env because their compilers don't have these compiler options so they must be using some equivalent compiler options. numeric.c will now be compiled with CFLAGS_VECTOR, so for them it will now be compiled with their equivalent of vectorize and unroll-loops option, which is ok, I think. Just that the numeric.o size will be increased, that's it. > > [1] > https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e -- Thanks, -Amit Khandekar Huawei Technologies From 029373ab2e9d2e6802326871f248ba2f21ffb204 Mon Sep 17 00:00:00 2001 From: Amit Khandekar Date: Tue, 21 Jul 2020 16:42:51 +0800 Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric product A 'for' loop in mul_var() runs backwards by decrementing two variables. This prevents the gcc compiler from auto-vectorizing the for loop. So make it a forward loop with a single variable. This gives performance benefits for product of numeric types with large precision, with speedups becoming noticeable from values with precisions starting from 20-40. Typical pattern of benefit is : precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on. On some CPU architectures, the speedup starts from 20 precision onwards. With the precisions used in the numeric_big regression test, the multiplication speeds up by 2.5 to 2.7 times. Auto-vectorization happens with gcc -O3 flag or -ftree-loop-vectorize. So arrange for -free-loop-vectorize flag specifically when compiling numeric.c. CFLAGS_VECTOR was already present for similar functionality in checksum.c, but CFLAGS_VECTOR also includes -funroll-loops which unecessarily makes numeric.o larger. So split CFLAGS_VECTOR into CFLAGS_UNROLL_LOOPS and CFLAGS_VECTOR. --- configure | 17 +++-- configure.in | 9 +++-- src/Makefile.global.in| 1 + src/backend/storage/page/Makefile | 2 +- src/backend/utils/adt/Makefile| 3 +++ src/backend/utils/adt/numeric.c | 11 --- 6 files changed, 31 insertions(+), 12 deletions(-) diff --git a/configure b/configure index cb8fbe1051..36ca734ad5 100755 --- a/configure +++ b/configure @@ -734,6 +734,7 @@ CPP CFLAGS_SL BITCODE_CXXFLAGS BITCODE_CFLAGS +CFLAGS_UNROLL_LOOPS CFLAGS_VECTOR PERMIT_DECLARATION_AFTER_STATEMENT LLVM_BINPATH @@ -5266,10 +5267,13 @@ BITCODE_CFLAGS="" user_BITCODE_CXXFLAGS=$BITCODE_CXXFLAGS BITCODE_CXXFLAGS="" -# set CFLAGS_VECTOR from the environment, if available +# set CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS from the environment, if available if test "$ac_env_CFLAGS_VECTOR_set" = set; then CFLAGS_VECTOR=$ac_env_CFLAGS_VECTOR_value fi +if test "$ac_env_CFLAGS_UNROLL_LOOPS_set" = set; then + CFLAGS_UNROLL_LOOPS=$ac_env_CFLAGS_UNROLL_LOOPS_value +fi # Some versions of GCC support some additional useful warning flags. # Check whether they are supported, and add them to CFLAGS if so. @@ -6102,16 +6106,16 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then fi - # Optimization flags for specific files that benefit from vectorization - { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5 -$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; } + # Optimization flags for specific files that benefit from loop unrolling + { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS" >&5 +$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS... " >&6; } if ${pgac_cv_prog_CC_cflags__funroll_loops+:} false; then : $as_echo_n "(cached) " >&6 else pgac_save_CFLAGS=$CFLAGS pgac_save_CC=$CC CC=${CC} -CFLAGS="${CFLAGS_VECTOR} -funroll-loops" +CFLAGS="${CFLAGS_UNROLL_LOOPS} -funroll-loops" ac_save_c_werror_flag=$ac_c_werror_flag
RE: Performing partition pruning using row value
>So, after looking at these functions and modifying this patch, I would like to >add this patch to the next I updated this patch and registered for the next CF . https://commitfest.postgresql.org/29/2654/ regards, sho kato pruning-with-row-wise-comparison-v2.patch Description: pruning-with-row-wise-comparison-v2.patch
Re: Wrong results from in_range() tests with infinite offset
On Tue, 21 Jul 2020 at 03:06, Tom Lane wrote: > > Pushed, but I chickened out of back-patching. The improvement in what > happens for finite comparison values seems somewhat counterbalanced by > the possibility that someone might not like the definition we arrived > at for infinities. So, it's not quite an open-and-shut bug fix, so > I just put it in HEAD (for now anyway). > Yeah, that seems fair enough, and it's quite an obscure corner-case that has gone unnoticed for quite some time. Regards, Dean
WAL segment switch on pg_start_backup()
Hi, I am currently exploring the pg_start_backup() and pg_stop_backup() functions. In the documentation (https://www.postgresql.org/docs/9.0/functions-admin.html), it is stated that after calling pg_stop_backup() Postgres switches to the new WAL segment file. But it doesn’t say the same for pg_start_backup(). However, I found the following comment regarding pg_start_backup() in the source code: Excerpt from Postgres source code https://doxygen.postgresql.org/xlog_8c_source.html#l10595 * Force an XLOG file switch before the checkpoint, to ensure that the * WAL segment the checkpoint is written to doesn't contain pages with * old timeline IDs. That would otherwise happen if you called * pg_start_backup() right after restoring from a PITR archive: the * first WAL segment containing the startup checkpoint has pages in * the beginning with the old timeline ID. That can cause trouble at * recovery: we won't have a history file covering the old timeline if * pg_wal directory was not included in the base backup and the WAL * archive was cleared too before starting the backup. So does it mean that Postgres always switches to the new WAL segment file on pg_start_backup() call too? If so, as I understood, the newly created WAL segment file should start from the checkpoint and should not contain any WAL records regarding the events that happened before pg_start_backup() call? Thanks, Daniil Zakhlystov
Re: new heapcheck contrib module
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul wrote: > > Hi Mark, > > I think new structures should be listed in src/tools/pgindent/typedefs.list, > otherwise, pgindent might disturb its indentation. > > Regards, > Amul > > > On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger > wrote: > > > > > > > > > On Jul 16, 2020, at 12:38 PM, Robert Haas wrote: > > > > > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger > > > wrote: > > >> The v10 patch without these ideas is here: > > > > > > Along the lines of what Alvaro was saying before, I think this > > > definitely needs to be split up into a series of patches. The commit > > > message for v10 describes it doing three pretty separate things, and I > > > think that argues for splitting it into a series of three patches. I'd > > > argue for this ordering: > > > > > > 0001 Refactoring existing amcheck btree checking functions to optionally > > > return corruption information rather than ereport'ing it. This is > > > used by the new pg_amcheck command line tool for reporting back to > > > the caller. > > > > > > 0002 Adding new function verify_heapam for checking a heap relation and > > > associated toast relation, if any, to contrib/amcheck. > > > > > > 0003 Adding new contrib module pg_amcheck, which is a command line > > > interface for running amcheck's verifications against tables and > > > indexes. > > > > > > It's too hard to review things like this when it's all mixed together. > > > > The v11 patch series is broken up as you suggest. > > > > > +++ b/contrib/amcheck/t/skipping.pl > > > > > > The name of this file is inconsistent with the tree's usual > > > convention, which is all stuff like 001_whatever.pl, except for > > > src/test/modules/brin, which randomly decided to use two digits > > > instead of three. There's no precedent for a test file with no leading > > > numeric digits. Also, what does "skipping" even have to do with what > > > the test is checking? Maybe it's intended to refer to the new error > > > handling "skipping" the actual error in favor of just reporting it > > > without stopping, but that's not really what the word "skipping" > > > normally means. Finally, it seems a bit over-engineered: do we really > > > need 183 test cases to check that detecting a problem doesn't lead to > > > an abort? Like, if that's the purpose of the test, I'd expect it to > > > check one corrupt relation and one non-corrupt relation, each with and > > > without the no-error behavior. And that's about it. Or maybe it's > > > talking about skipping pages during the checks, because those pages > > > are all-visible or all-frozen? It's not very clear to me what's going > > > on here. > > > > The "skipping" did originally refer to testing verify_heapam()'s option to > > skip all-visible or all-frozen blocks. I have renamed it > > 001_verify_heapam.pl, since it tests that function. > > > > > > > > + TransactionId nextKnownValidXid; > > > + TransactionId oldestValidXid; > > > > > > Please add explanatory comments indicating what these are intended to > > > mean. > > > > Done. > > > > > For most of the the structure members, the brief comments > > > already present seem sufficient; but here, more explanation looks > > > necessary and less is provided. The "Values for returning tuples" > > > could possibly also use some more detail. > > > > Ok, I've expanded the comments for these. > > > > > +#define HEAPCHECK_RELATION_COLS 8 > > > > > > I think this should really be at the top of the file someplace. > > > Sometimes people have adopted this style when the #define is only used > > > within the function that contains it, but that's not the case here. > > > > Done. > > > > > > > > + ereport(ERROR, > > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > > + errmsg("unrecognized parameter for 'skip': %s", skip), > > > + errhint("please choose from 'all visible', 'all frozen', " > > > + "or NULL"))); > > > > > > I think it would be better if we had three string values selecting the > > > different behaviors, and made the parameter NOT NULL but with a > > > default. It seems like that would be easier to understand. Right now, > > > I can tell that my options for what to skip are "all visible", "all > > > frozen", and, uh, some other thing that I don't know what it is. I'm > > > gonna guess the third option is to skip nothing, but it seems best to > > > make that explicit. Also, should we maybe consider spelling this > > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > > > Spaces in an option value seems a little icky to me somehow. > > > > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults > > to 'none'. I did not mark the function as strict, as I think NULL is a > > reasonable value (and the default) for startblock and endblock. > > > > > + int64 startblock = -1; > > > + int64 endblock = -1; > > > ... > > > + if (!PG_ARGISNULL(3)) > > > + startblock = PG_GETARG_INT64(3); > > > + if (!PG_ARGISNULL(4)) > > > + endblock =