Re: Berserk Autovacuum (let's save next Mandrill)
On Tue, Sep 10, 2019 at 8:19 PM Amit Kapila wrote: > > On Tue, Jul 23, 2019 at 1:53 PM Masahiko Sawada wrote: > > > > > > To invoke autovacuum even on insert-only tables we would need check > > the number of inserted tuples since last vacuum. I think we can keep > > track of the number of inserted tuples since last vacuum to the stats > > collector and add the threshold to invoke vacuum with INDEX_CLEANUP = > > false. If an autovacuum worker confirms that the number of inserted > > tuples exceeds the threshold it invokes vacuum with INDEX_CLEANUP = > > false. However if the number of dead tuples also exceeds the > > autovacuum thresholds (autovacuum_vacuum_threshold and > > autovacuum_vacuum_scale_factor) it should invoke vacuum with > > INDEX_CLEANUP = true. Therefore new threshold makes sense only when > > it's lower than the autovacuum thresholds. > > > > I guess we can have one new GUC parameter to control scale factor. > > Since only relatively large tables will require this feature we might > > not need the threshold based the number of tuples. > > > > Generally speaking, having more guc's for autovacuum and that too > which are in some way dependent on existing guc's sounds bit scary, > but OTOH whatever you wrote makes sense and can help the scenarios > which this thread is trying to deal with. Have you given any thought > to what Alvaro mentioned up-thread "certain tables would have some > sort of partial scan that sets the visibility map. There's no reason > to invoke the whole vacuuming machinery. I don't think this is > limited to append-only tables, but > rather those are just the ones that are affected the most."? > Speaking of partial scan I've considered before that we could use WAL to find which pages have garbage much and not all-visible pages. We can vacuum only a particular part of table that is most effective of garbage collection instead of whole tables. I've shared some results of that at PGCon and it's still in PoC state. Also, to address the issue of updating VM of mostly-append-only tables I considered some possible solutions: 1. Using INDEX_CLEANUP = false and TRUNCATE = false vacuum does hot pruning, vacuuming table and updating VM. In addition to updating VM we need to do other two operations but since the mostly-insert-only tables would have less garbage the hot pruning and vacuuming table should be light workload. This is what I proposed on up-thread. 2. This may have already been discussed before but we could update VM when hot pruning during SELECT operation. Since this affects SELECT performance it should be enabled on only particular tables by user request. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Fri, Sep 13, 2019 at 09:59:40AM +0530, Amit Kapila wrote: > I think that is what we have not done in one of the cases pointed by me. Thinking more about it, I see your point now. HEAP_LOCKED_UPGRADED is not a direct combination of the other flags and depends on other conditions, so we cannot make a combination of it with other things. The three others don't have that problem. Attached is a patch to fix your suggestions. This also removes the use of HEAP_XMAX_IS_LOCKED_ONLY which did not make completely sense either as a "raw" flag. While on it, the order of the flags can be improved to match more the order of htup_details.h Does this patch address your concerns? -- Michael diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out index 6a09d46a57..76f02dbea2 100644 --- a/contrib/pageinspect/expected/page.out +++ b/contrib/pageinspect/expected/page.out @@ -91,7 +91,7 @@ FROM heap_page_items(get_raw_page('test1', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(flags); t_infomask | t_infomask2 | flags +-+--- - 2816 | 2 | {HEAP_XMAX_INVALID,HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID} + 2816 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} (1 row) -- output the decoded flag HEAP_XMIN_FROZEN instead @@ -100,7 +100,7 @@ FROM heap_page_items(get_raw_page('test1', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2, true) m(flags); t_infomask | t_infomask2 |flags +-+-- - 2816 | 2 | {HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN} + 2816 | 2 | {HEAP_XMIN_FROZEN,HEAP_XMAX_INVALID} (1 row) -- tests for decoding of combined flags @@ -140,20 +140,7 @@ SELECT heap_tuple_infomask_flags(x'C000'::int, 0, true); SELECT heap_tuple_infomask_flags(x'C000'::int, 0, false); heap_tuple_infomask_flags - {HEAP_MOVED_IN,HEAP_MOVED_OFF} -(1 row) - --- HEAP_LOCKED_UPGRADED = (HEAP_XMAX_IS_MULTI | HEAP_XMAX_LOCK_ONLY) -SELECT heap_tuple_infomask_flags(x'1080'::int, 0, true); - heap_tuple_infomask_flags - {HEAP_LOCKED_UPGRADED} -(1 row) - -SELECT heap_tuple_infomask_flags(x'1080'::int, 0, false); -heap_tuple_infomask_flags --- - {HEAP_XMAX_LOCK_ONLY,HEAP_XMAX_IS_MULTI} + {HEAP_MOVED_OFF,HEAP_MOVED_IN} (1 row) -- test all flags of t_infomask and t_infomask2 diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index 68f16cd400..c696d7d6d1 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/pageinspect/heapfuncs.c @@ -540,12 +540,8 @@ heap_tuple_infomask_flags(PG_FUNCTION_ARGS) d[cnt++] = CStringGetTextDatum("HEAP_HASOID_OLD"); if ((t_infomask & HEAP_COMBOCID) != 0) d[cnt++] = CStringGetTextDatum("HEAP_COMBOCID"); - if ((t_infomask & HEAP_XMAX_COMMITTED) != 0) - d[cnt++] = CStringGetTextDatum("HEAP_XMAX_COMMITTED"); - if ((t_infomask & HEAP_XMAX_INVALID) != 0) - d[cnt++] = CStringGetTextDatum("HEAP_XMAX_INVALID"); - if ((t_infomask & HEAP_UPDATED) != 0) - d[cnt++] = CStringGetTextDatum("HEAP_UPDATED"); + if ((t_infomask & HEAP_XMAX_LOCK_ONLY) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_XMAX_LOCK_ONLY"); /* decode combined masks of t_infomaks */ if (decode_combined && (t_infomask & HEAP_XMAX_SHR_LOCK) != 0) @@ -568,24 +564,23 @@ heap_tuple_infomask_flags(PG_FUNCTION_ARGS) d[cnt++] = CStringGetTextDatum("HEAP_XMIN_INVALID"); } + if ((t_infomask & HEAP_XMAX_COMMITTED) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_XMAX_COMMITTED"); + if ((t_infomask & HEAP_XMAX_INVALID) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_XMAX_INVALID"); + if ((t_infomask & HEAP_XMAX_IS_MULTI) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_XMAX_IS_MULTI"); + if ((t_infomask & HEAP_UPDATED) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_UPDATED"); + if (decode_combined && (t_infomask & HEAP_MOVED) != 0) d[cnt++] = CStringGetTextDatum("HEAP_MOVED"); else { - if ((t_infomask & HEAP_MOVED_IN) != 0) - d[cnt++] = CStringGetTextDatum("HEAP_MOVED_IN"); if ((t_infomask & HEAP_MOVED_OFF) != 0) d[cnt++] = CStringGetTextDatum("HEAP_MOVED_OFF"); - } - - if (decode_combined && HEAP_LOCKED_UPGRADED(t_infomask)) - d[cnt++] = CStringGetTextDatum("HEAP_LOCKED_UPGRADED"); - else - { - if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) - d[cnt++] = CStringGetTextDatum("HEAP_XMAX_LOCK_ONLY"); - if ((t_infomask & HEAP_XMAX_IS_MULTI) != 0) - d[cnt++] = CStringGetTextDatum("HEAP_XMAX_IS_MULTI"); + if ((t_infomask & HEAP_MOVED_IN) != 0) + d[cnt++] = CStringGetTextDatum("HEAP_MOVED_IN"); } /* decode t_infomask2 */ diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql index
Re: Create collation reporting the ICU locale display name
Michael Paquier writes: > On Thu, Sep 12, 2019 at 03:03:43PM -0400, Tom Lane wrote: >> The idea of having CREATE COLLATION automatically create a comment >> is sort of interesting, although it seems pretty orthogonal to >> normal command behavior. I wonder whether the seeming need for >> this indicates that we should add a descriptive field to pg_collation >> proper, and not usurp the user-oriented comment feature for that. >> >> The difficulty with localization is that whatever we put into >> template1 has got to be ASCII-only, so that the template DB >> can be copied to other encodings. I suppose we could consider >> having CREATE COLLATION act differently during initdb than >> later, but that seems ugly too. > Or could it make sense to provide a system function which returns a > collation description for at least an ICU-provided one? We could make > use of that in psql for example. Oh, that seems like a good way to tackle it. regards, tom lane
Re: refactoring - share str2*int64 functions
On Tue, Sep 10, 2019 at 12:05:25PM +0900, Michael Paquier wrote: > Attached is an updated patch? How does it look? I have left the > parts of readfuncs.c for now as there are more issues behind that than > doing a single switch, short reads are one, long reads a second. And > the patch already does a lot. There could be also an argument for > having extra _check wrappers for the unsigned portions but these would > be mostly unused in the backend code, so I have left that out on > purpose. I have looked at this patch again today after letting it aside a couple of days, and I quite like the resulting shape of the routines. Does anybody else have any comments? Would it make sense to extend more the string-to-int conversion routines with a set of control flags to bypass the removal of leading and trailing whitespaces? -- Michael signature.asc Description: PGP signature
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Fri, Sep 13, 2019 at 9:00 AM Michael Paquier wrote: > > On Thu, Sep 12, 2019 at 05:24:17PM +0530, Amit Kapila wrote: > > On Thu, Sep 12, 2019 at 4:48 PM Michael Paquier wrote: > > Hmm, I thought when decode_combined flag is set to false means we will > > display the raw flags set on the tuple without any further > > interpretation. I think that is what is most people in thread > > advocated about. > > Sorry if I created any confusion. When set to false then the raw list > of flags is returned, and that's the default. > I think that is what we have not done in one of the cases pointed by me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Berserk Autovacuum (let's save next Mandrill)
On Fri, Sep 13, 2019 at 8:52 AM Masahiko Sawada wrote: > On Tue, Sep 10, 2019 at 8:19 PM Amit Kapila wrote: > > > > > > Generally speaking, having more guc's for autovacuum and that too > > which are in some way dependent on existing guc's sounds bit scary, > > but OTOH whatever you wrote makes sense and can help the scenarios > > which this thread is trying to deal with. Have you given any thought > > to what Alvaro mentioned up-thread "certain tables would have some > > sort of partial scan that sets the visibility map. There's no reason > > to invoke the whole vacuuming machinery. I don't think this is > > limited to append-only tables, but > > rather those are just the ones that are affected the most."? > > > > Speaking of partial scan I've considered before that we could use WAL > to find which pages have garbage much and not all-visible pages. We > can vacuum only a particular part of table that is most effective of > garbage collection instead of whole tables. I've shared some results > of that at PGCon and it's still in PoC state. > > Also, to address the issue of updating VM of mostly-append-only tables > I considered some possible solutions: > > 1. Using INDEX_CLEANUP = false and TRUNCATE = false vacuum does hot > pruning, vacuuming table and updating VM. In addition to updating VM > we need to do other two operations but since the mostly-insert-only > tables would have less garbage the hot pruning and vacuuming table > should be light workload. This is what I proposed on up-thread. > Yes, this is an option, but it might be better if we can somehow avoid triggering the vacuum machinery. > 2. This may have already been discussed before but we could update > VM when hot pruning during SELECT operation. Since this affects SELECT > performance it should be enabled on only particular tables by user > request. > Yeah, doing anything additional in SELECT's can be tricky and think of a case where actually there is nothing to prune on-page, in that case also if we run the visibility checks and then mark the visibility map, then it can be a noticeable overhead. OTOH, I think this will be a one-time overhead because after the first scan the visibility map will be updated and future scans don't need to update visibility map unless someone has updated that page. I was wondering why not do this during write workloads. For example, when Insert operation finds that there is no space in the current page and it has to move to next page, it can check if the page (that doesn't have space to accommodate current tuple) can be marked all-visible. In this case, we would have already done the costly part of an operation which is to Read/Lock the buffer. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: pgbench - allow to create partitioned tables
On Wed, Sep 11, 2019 at 6:08 PM Fabien COELHO wrote: > Attached an updated version. I have reviewed the patch and done some basic testing. It works as per the expectation I have a few cosmetic comments 1. + if (partitions >= 1) + { + char ff[64]; + ff[0] = '\0'; + append_fillfactor(ff, sizeof(ff)); Generally, we give one blank line between the variable declaration and the first statement of the block. 2. + if (p == 1) + sprintf(minvalue, "minvalue"); + else + sprintf(minvalue, INT64_FORMAT, (p-1) * part_size + 1); (p-1) -> (p - 1) I am just wondering will it be a good idea to expand it to support multi-level partitioning? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Thu, Sep 12, 2019 at 05:24:17PM +0530, Amit Kapila wrote: > On Thu, Sep 12, 2019 at 4:48 PM Michael Paquier wrote: > Hmm, I thought when decode_combined flag is set to false means we will > display the raw flags set on the tuple without any further > interpretation. I think that is what is most people in thread > advocated about. Sorry if I created any confusion. When set to false then the raw list of flags is returned, and that's the default. The example provided in the docs is careful about that, as well as the description done for the option (at least I guess so!). > Yes, I think we could have more discussion on this point. It is not > 100% clear how we should interpret this flag and or where to draw a > line. It might be that whatever we have done is alright, but still, > it is worth more discussion and opinion from a few more people. Of course. >> decode_combined sounds like a good compromise to me. If there is a >> better consensus, well, let's use it, but I don't find those >> suggestions to be improvements. > > I think it depends on the meaning of that flag. Perhaps using "decode" is the confusing part here? It is more like a "merge" of the flags, or just a combination of them. An idea that just popped here would be to name the switch "combine_flags" instead. -- Michael signature.asc Description: PGP signature
Re: Create collation reporting the ICU locale display name
On Thu, Sep 12, 2019 at 03:03:43PM -0400, Tom Lane wrote: > The idea of having CREATE COLLATION automatically create a comment > is sort of interesting, although it seems pretty orthogonal to > normal command behavior. I wonder whether the seeming need for > this indicates that we should add a descriptive field to pg_collation > proper, and not usurp the user-oriented comment feature for that. > > The difficulty with localization is that whatever we put into > template1 has got to be ASCII-only, so that the template DB > can be copied to other encodings. I suppose we could consider > having CREATE COLLATION act differently during initdb than > later, but that seems ugly too. Or could it make sense to provide a system function which returns a collation description for at least an ICU-provided one? We could make use of that in psql for example. -- Michael signature.asc Description: PGP signature
Re: psql - improve test coverage from 41% to 88%
On Thu, Sep 12, 2019 at 12:14:16PM -0300, Alvaro Herrera wrote: > Mostly, because I think they're going to cause trouble. Adding a > parameter in the middle of the list may cause trouble for third-party > users of TestLib. I propose that we make the routines a bit smarter to > cope with the API change: use named parameters instead. And in order to > do that without having to change existing users of command_check, make > it so that the routine checks whether the parameter is a hashref, and > behave differently. So when called as in the existing callsites (five > scalar parameters) it behaves as currently. +1. -- Michael signature.asc Description: PGP signature
[bug fix??] Fishy code in tts_cirtual_copyslot()
Hello, In the following code in execTuples.c, shouldn' srcdesc point to the source slot's tuple descriptor? The attached fix passes make check. What kind of failure could this cause? BTW, I thought that in PostgreSQL coding convention, local variables should be defined at the top of blocks, but this function writes "for (int natts;". I didn't modify it because many other source files also write in that way. -- static void tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) { TupleDesc srcdesc = dstslot->tts_tupleDescriptor; Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); tts_virtual_clear(dstslot); slot_getallattrs(srcslot); for (int natt = 0; natt < srcdesc->natts; natt++) { dstslot->tts_values[natt] = srcslot->tts_values[natt]; dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; } -- Regards Takayuki Tsunakawa virtslot_tiny_fix.patch Description: virtslot_tiny_fix.patch
Re: Leakproofness of texteq()/textne()
Robert Haas writes: > I wouldn't feel comfortable with ignoring information leaks that can > happen with some valid strings but not others. That sounds like > exactly the sort of information leak that we must prevent. The user > can write arbitrary stuff in their query, potentially transforming > strings so that the result hits the ERROR iff the original string had > some arbitrary property P for which they wish to test. Agreed. > However, I wonder if there's any realistic case outside of an encoding > conversion where such failures can occur. I would expect, perhaps > naively, that the set of characters that can be represented by UTF-16 > is the same set as can be represented by UTF-8. Unless Microsoft did something weird, that doesn't seem very likely. I'm more worried about the possibility that ICU contains weird exception cases that will make it fail to compare specific strings. Noting that ucnv_toUChars has an error indicator but ucol_strcoll does not, it seems like again any such cases are going to boil down to encoding conversion problems. However, if there is some character C that makes ICU misbehave like that, we are going to have problems with indexing strings containing C, whether we think varstr_cmp is leaky or not. So I'm not sure that focusing our attention on leakiness is a helpful way to proceed. regards, tom lane
Re: Leakproofness of texteq()/textne()
On Thu, Sep 12, 2019 at 1:38 PM Tom Lane wrote: > In any case, from a purely theoretical viewpoint, such an error message > *does* constitute a leak of information about the input strings. Whether > it's a usable leak is very debatable, but that's basically what we've > got to decide. I'm pretty content to ignore information leaks that can only happen if the database is corrupt anyway. If that's moving the goalposts at all, it's about a quarter-inch. I mean, a slightly differently corrupted varlena would could crash the database entirely. I wouldn't feel comfortable with ignoring information leaks that can happen with some valid strings but not others. That sounds like exactly the sort of information leak that we must prevent. The user can write arbitrary stuff in their query, potentially transforming strings so that the result hits the ERROR iff the original string had some arbitrary property P for which they wish to test. Allowing that sounds no different than deciding that int4div is leakproof, which it sure isn't. However, I wonder if there's any realistic case outside of an encoding conversion where such failures can occur. I would expect, perhaps naively, that the set of characters that can be represented by UTF-16 is the same set as can be represented by UTF-8. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
abort-time portal cleanup
I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few days and have come to the conclusion that the code is not entirely up to our usual standards. I believe that a good deal of the reason for this is attributable to the poor quality of the code comments in this area, although there are perhaps some other contributing factors as well, such as bullheadedness on my part and perhaps others. The trouble starts with the header comment for AtAbort_Portals, which states that "At this point we run the cleanup hook if present, but we can't release the portal's memory until the cleanup call." At the time this logic was introduced in commit de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02), AtAbort_Portals() affected all non-held portals without caring whether they were active or not, and, UserAbortTransactionBlock() called AbortTransaction() directly, so typing "ROLLBACK;" would cause AtAbort_Portals() to be reached from within PortalRun(). Even if PortalRun() managed to return without crashing, the caller would next try to call PortalDrop() on what was now an invalid pointer. However, commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed things so that UserAbortEndTransaction() just sets things up so that the subsequent call to CommitTransactionCommand() would call AbortTransaction() instead of trying to do it right away, and that doesn't happen until after we're done with the portal. As far as I can see, that change made this comment mostly false, but the comment has nevertheless managed to survive for another ~15 years. I think we can, and in fact should, just drop the portal right here. As far as actually making that work, there are a few wrinkles. The first is that we might be in the middle of FATAL. In that case, unlike the ROLLBACK case, a call to PortalRun() is still on the stack, but we'll exit the process rather than returning, so the fact that we've created a dangling pointer for the caller won't matter. However, as shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01) and the report that led up to it at https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de it's not a good idea to try to clean up the portal in that case, because we might've already started shutting down critical systems. It seems not only risky but also unnecessary: our process-local state is about to go away, and the executor shouldn't need to clean up any shared memory state that won't also get cleaned up by some other mechanism. So, it seems to me that if we reach AtAbort_Portals() during FATAL processing, we should either (1) do nothing at all and just return or (2) forget about all the existing portals without cleaning them up and then return. The second option seems a little safer to me, because it guarantees that if we somehow reach code that might try to look up a portal later, it won't find anything. But I think it's arguable. The second wrinkle is that there might be an active portal. Apart from the FATAL case already mentioned, I think the only way this can happen is some C code that calls purposefully calls AbortTransaction() in the middle of executing a command. It can't be an ERROR, because then the portal would be marked as failed before we get here, and it can't be an explicit ROLLBACK, because as noted above, that case was changed 15 years ago. It's got to be some other case where C code calls AbortTransaction() voluntarily in the middle of a statement. For over a decade, there were no cases at all of this type, but the code in this function catered to hypothetical cases by marking the portal failed. By 2016, Noah had figured out that this was bogus, and that any future cases would likely require different handling, but Tom and I shouted him down: http://postgr.es/m/67674.1454259...@sss.pgh.pa.us The end result of that discussion was commit 41baee7a9312eefb315b6b2973ac058c9efaa9cf (2016-02-05) which left the code as it was but added comments nothing that it was wrong. It actually wasn't entirely wrong, because it handled the FATAL case mentioned above by the byzantine mechanism of invoking the portal's cleanup callback after first setting the status to PORTAL_FAILED. Since the only existing cleanup callback arranges to do nothing if the status is PORTAL_FAILED, this worked out to a complicated way of (correctly) skipping callback in the FATAL case. But, probably because that wasn't documented in any understandable way, possibly because nobody really understood it, when commit 8561e4840c81f7e345be2df170839846814fa004 (2018-01-22) added support for transaction control in procedures, it just removed the code marking active portals as failed, just as Noah had predicted would be necessary ~2 years earlier. Andres noticed that this broke the FATAL case and tracked it back to the removal of this code, resulting it it getting put back, but just for the FATAL case, in commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01). See
Re: Fix XML handling with DOCTYPE
Hi Chapman, On 2019-Sep-05, Chapman Flack wrote: > Are these on your radar to maybe backpatch in this round of activity? > > The latest patches I did for 11 and 10 are in > https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net Thanks! I just pushed these to those branches. I think we're finally done with these. Many thanks for your persistence. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: psql - add SHOW_ALL_RESULTS option
This v6 is just Fabien's v5, rebased over a very minor conflict, and pgindented. No further changes. I've marked this Ready for Committer. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out index 6787ec1efd..de59a5cf65 100644 --- a/contrib/pg_stat_statements/expected/pg_stat_statements.out +++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out @@ -49,17 +49,42 @@ BEGIN \; SELECT 2.0 AS "float" \; SELECT 'world' AS "text" \; COMMIT; + float +--- + 2.0 +(1 row) + + text +--- + world +(1 row) + -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; - ?column? --- -5 +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; + + +--- + 6 +(1 row) + + || +- + ! +(1 row) + + + +--- + 5 (1 row) -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset SELECT :add + 1 + 1 AS "add" \; SELECT :add + 1 + 1 AS "add" \gset + add +- + 5 +(1 row) + -- set operator SELECT 1 AS i UNION SELECT 2 ORDER BY i; i @@ -102,12 +127,12 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C"; SELECT $1+| 4 |4 +| | AS "text" | | - SELECT $1 + $2| 2 |2 SELECT $1 + $2 + $3 AS "add" | 3 |3 + SELECT $1 + $2 AS "+" | 2 |2 SELECT $1 AS "float" | 1 |1 SELECT $1 AS "int"| 2 |2 SELECT $1 AS i UNION SELECT $2 ORDER BY i | 1 |2 - SELECT $1 || $2 | 1 |1 + SELECT $1 || $2 AS "||" | 1 |1 SELECT pg_stat_statements_reset() | 1 |1 WITH t(f) AS ( +| 1 |2 VALUES ($1), ($2) +| | diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql index 8b527070d4..ea3a31176e 100644 --- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql +++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql @@ -27,7 +27,7 @@ SELECT 'world' AS "text" \; COMMIT; -- compound with empty statements and spurious leading spacing -\;\; SELECT 3 + 3 \;\;\; SELECT ' ' || ' !' \;\; SELECT 1 + 4 \;; +\;\; SELECT 3 + 3 AS "+" \;\;\; SELECT ' ' || ' !' AS "||" \;\; SELECT 1 + 4 AS "+" \;; -- non ;-terminated statements SELECT 1 + 1 + 1 AS "add" \gset diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..4e6ab5a0a5 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) - Also, psql only prints the - result of the last SQL command in the string. - This is different from the behavior when the same string is read from - a file or fed to psql's standard input, - because then psql sends - each SQL command separately. - Because of this behavior, putting more than one SQL command in a - single -c string often has unexpected results. - It's better to use repeated -c commands or feed - multiple commands to psql's standard input, + If having several commands executed in one transaction is not desired, + use repeated -c commands or feed multiple commands to + psql's standard input, either using echo as illustrated above, or via a shell here-document, for example: @@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3; commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) -psql prints only the last query result -it receives for each request; in this example, although all -three SELECTs are indeed executed, psql -only prints the 3. +psql prints all results it receives, one +after the other. diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 4b2679360f..fa358c8c58 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -486,6 +486,16 @@ ResetCancelConn(void) #endif } +static void +ShowErrorMessage(const PGresult *result)
Re: Create collation reporting the ICU locale display name
Alvaro Herrera writes: > I wonder if INFO is better than NOTICE (I think it is). You're just waving a red flag in front of a bull, you know. I don't especially like the idea of having this emit a NOTICE; it's ugly and in-your-face. INFO is right out. The idea of having CREATE COLLATION automatically create a comment is sort of interesting, although it seems pretty orthogonal to normal command behavior. I wonder whether the seeming need for this indicates that we should add a descriptive field to pg_collation proper, and not usurp the user-oriented comment feature for that. The difficulty with localization is that whatever we put into template1 has got to be ASCII-only, so that the template DB can be copied to other encodings. I suppose we could consider having CREATE COLLATION act differently during initdb than later, but that seems ugly too. regards, tom lane
Re: another look at macOS SIP
On 2019-09-10 19:26, Tom Lane wrote: >> I think the way forward here is to get rid of all uses of system() for >> calling between PostgreSQL programs. > > We could do that perhaps, but how are you going to get make to not use > /bin/sh while spawning subprocesses? I don't think we want to also > reimplement make ... make is not a problem if the DYLD_* assignments are in a makefile rule (as currently), because then make just calls a shell with a string "DYLD_*=foo some command", which is not affected by any filtering. It would be a problem if you do the variable assignments in a makefile outside a rule or outside a makefile. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pglz performance
I just noticed we had two CF items pointing to this thread, https://commitfest.postgresql.org/24/2119/ https://commitfest.postgresql.org/24/2180/ so I marked the newer one as withdrawn. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Create collation reporting the ICU locale display name
On 2019-Sep-12, Daniel Verite wrote: > Michael Paquier wrote: > > > On Wed, Sep 11, 2019 at 04:53:16PM +0200, Daniel Verite wrote: > > > I think it would be nice to have CREATE COLLATION report this > > > information as feedback in the form of a NOTICE message. > > > PFA a simple patch implementing that. > > > > Why is that better than the descriptions provided with \dO[S]+ in > > psql? > > There is no description for collations created outside of > pg_import_system_collations(). Hmm, sounds like the collation should automatically acquire the display name as a comment even when created via CREATE COLLATION. I wonder if INFO is better than NOTICE (I think it is). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Create collation reporting the ICU locale display name
On Thu, Sep 12, 2019 at 11:30 AM Peter Geoghegan wrote: > I wonder if it's possible to display a localized version of the > display string in the NOTICE message? Does that work, or could it? For > example, do you see the message in French? BTW, I already know for sure that ICU supports localized display names. The question is whether or not this patch can take advantage of that. The way that we use display name in pg_import_system_collations() is an ugly hack. It insists on only storing ASCII-safe strings in pg_collation. -- Peter Geoghegan
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2019-09-12 09:57:55 -0300, Alvaro Herrera wrote: > On 2019-Sep-12, Tomas Vondra wrote: > > > On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant > > wrote: > > > On 2019-Sep-11, Amit Khandekar wrote: > > > > I think doing this all the time would make restore very slow -- there's a > > > reason we keep the files open, after all. > > > > How much slower? It certainly will have a hit, but maybe it's negligible > > compared to all the other stuff happening in this code? I'd expect it to be significant. > > As a sidenote - in the other thread about streaming, one of the patches > > does change how we log subxact assignments. In the end, this allows using > > just a single file for the top-level transaction, instead of having one > > file per subxact. That would also solve this. Uhm, how is rollback to savepoint going to be handled in that case? I don't think it's great to just retain space for all rolled back subtransactions. Greetings, Andres Freund
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, On 2019-09-12 09:41:02 -0400, Robert Haas wrote: > On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra > wrote: > > I don't see how the current API could do that transparently - it does > > track the files, but the user only gets a file descriptor. With just a > > file descriptor, how could the code know to do reopen/seek when it's going > > just through the regular fopen/fclose? > > > > Anyway, I agree we need to do something, to fix this corner case (many > > serialized in-progress transactions). ISTM we have two options - either do > > something in the context of reorderbuffer.c, or extend the transient file > > API somehow. I'd say the second option is the right thing going forward, > > because it does allow doing it transparently and without leaking details > > about maxAllocatedDescs etc. There are two issues, though - it does > > require changes / extensions to the API, and it's not backpatchable. > > > > So maybe we should start with the localized fix in reorderbuffer, and I > > agree tracking offset seems reasonable. > > We've already got code that knows how to track this sort of thing. > You just need to go through the File abstraction (PathNameOpenFile or > PathNameOpenFilePerm or OpenTemporaryFile) rather than using the > functions that deal directly with fds (OpenTransientFile, > BasicOpenFile, etc.). It seems like it would be better to reuse the > existing VFD layer than to invent a whole new one specific to logical > replication. Yea, I agree that that is the right fix. Greetings, Andres Freund
Re: Create collation reporting the ICU locale display name
On Wed, Sep 11, 2019 at 7:53 AM Daniel Verite wrote: > The 'locale' or 'lc_collate/lc_ctype' argument of an ICU collation may > have a complicated syntax, especially with non-deterministic > collations, and input mistakes in these names will not necessarily be > detected as such by ICU. That's a real problem. > The "display name" of a locale is a simple way to get human-readable > feedback about the characteristics of that locale. > pg_import_system_collations() already push these as comments when > creating locales en masse. > > I think it would be nice to have CREATE COLLATION report this > information as feedback in the form of a NOTICE message. > PFA a simple patch implementing that. I like this idea. I wonder if it's possible to display a localized version of the display string in the NOTICE message? Does that work, or could it? For example, do you see the message in French? -- Peter Geoghegan
Re: Leakproofness of texteq()/textne()
I wrote: > I agree with your point that this is a shouldn't-happen corner case. > The question boils down to, if it *does* happen, does that constitute > a meaningful information leak? Up to now we've taken quite a hard > line about what leakproofness means, so deciding that varstr_cmp > is leakproof would constitute moving the goalposts a bit. They'd > still be in the same stadium, though, IMO. For most of us it might be more meaningful to look at the non-Windows code paths, for which the question reduces to what we think of this: UErrorCodestatus; status = U_ZERO_ERROR; result = ucol_strcollUTF8(mylocale->info.icu.ucol, arg1, len1, arg2, len2, ); if (U_FAILURE(status)) ereport(ERROR, (errmsg("collation failed: %s", u_errorName(status; which, as it happens, is also a UTF8-encoding-only code path. Can this throw an error in practice, and if so does that constitute a meaningful information leak? (For bonus points: is this error report up to project standards?) Thumbing through the list of UErrorCode values, it seems like the only ones that are applicable here and aren't internal-error cases are U_INVALID_CHAR_FOUND and the like, so that this boils down to "one of the strings contains a character that ICU can't cope with". Maybe that's impossible except with a pre-existing encoding violation, or maybe not. In any case, from a purely theoretical viewpoint, such an error message *does* constitute a leak of information about the input strings. Whether it's a usable leak is very debatable, but that's basically what we've got to decide. regards, tom lane
Re: Leakproofness of texteq()/textne()
Robert Haas writes: > On Thu, Sep 12, 2019 at 12:19 PM Tom Lane wrote: >> After burrowing down further, it's visibly the case that >> text_cmp and varstr_cmp don't leak in the sense of actually >> reporting any part of their input strings. What they do do, >> in some code paths, is things like >> ereport(ERROR, >> (errmsg("could not convert string to UTF-16: error code %lu", >> GetLastError(; > Is this possible? I mean, I'm sure it could happen if the data's > corrupted, but we ought to have validated it on the way into the > database. But maybe this code path also gets used for non-Unicode > encodings? Nope, the above is inside #ifdef WIN32 /* Win32 does not have UTF-8, so we need to map to UTF-16 */ if (GetDatabaseEncoding() == PG_UTF8 && (!mylocale || mylocale->provider == COLLPROVIDER_LIBC)) I agree with your point that this is a shouldn't-happen corner case. The question boils down to, if it *does* happen, does that constitute a meaningful information leak? Up to now we've taken quite a hard line about what leakproofness means, so deciding that varstr_cmp is leakproof would constitute moving the goalposts a bit. They'd still be in the same stadium, though, IMO. Another approach would be to try to remove these failure cases, but I don't really see how we'd do that. regards, tom lane
Re: Leakproofness of texteq()/textne()
On Thu, Sep 12, 2019 at 12:19 PM Tom Lane wrote: > After burrowing down further, it's visibly the case that > text_cmp and varstr_cmp don't leak in the sense of actually > reporting any part of their input strings. What they do do, > in some code paths, is things like > > ereport(ERROR, > (errmsg("could not convert string to UTF-16: error code %lu", > GetLastError(; Is this possible? I mean, I'm sure it could happen if the data's corrupted, but we ought to have validated it on the way into the database. But maybe this code path also gets used for non-Unicode encodings? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Leakproofness of texteq()/textne()
I wrote: > Seems like we have two choices: > 1. Remove the leakproof marking on texteq()/textne(). This > would, unfortunately, be catastrophic for performance in > a lot of scenarios where leakproofness matters. > 2. Fix text_cmp to actually be leakproof, whereupon we might > as well mark all the remaining btree comparison operators > leakproof. > I think #2 is pretty obviously the superior answer, if we > can do it. After burrowing down further, it's visibly the case that text_cmp and varstr_cmp don't leak in the sense of actually reporting any part of their input strings. What they do do, in some code paths, is things like ereport(ERROR, (errmsg("could not convert string to UTF-16: error code %lu", GetLastError(; So this seems to mostly be a question of interpretation: does an error like this represent enough of an information leak to violate the promises of leakproofness? I'm not sure that this error is really capable of disclosing any information about the input strings. (Note that this error occurs only if we failed to convert UTF8 to UTF16, which probably could only happen if the input isn't valid UTF8, which would mean a failure of encoding checking somewhere up the line.) There are also various pallocs and such that could fail, which perhaps could be argued to disclose the lengths of the input strings. But that hazard exists already inside PG_GETARG_TEXT_PP; if you want to complain about that, then functions like byteaeq() aren't legitimately leakproof either. On the whole I'm prepared to say that these aren't leakproofness violations, but it would depend a lot on how strict you want to be. regards, tom lane
Re: Pulling up direct-correlated ANY_SUBLINK
Richard Guo wrote: > On Wed, Sep 11, 2019 at 3:25 PM Antonin Houska > wrote: > > > Nevertheless, I don't know how to overcome the problems that I > mentioned > upthread. > > > Do you mean the problem "the WHERE clause of the subquery didn't > participate in the SEMI JOIN evaluation"? Good news is it has been > fixed > by commit 043f6ff0 as I mentioned upthread. Do you say that my old patch (rebased) no longer breaks the regression tests? (I noticed your other email in the thread which seems to indicate that you're no lo longer interested to work on the feature, but asking out of curiosity.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
> OK, so we have that now. I suppose this spreadsheet now tells us which > places are useful and which aren't, at least for the queries that you've > tested. Dowe that mean that we want to get the patch to consider adding > paths only the places that your spreadsheet says are useful? I'm not > sure what the next steps are for this patch. I wanted to note here that I haven't abandoned this patch, but ended up needing to use my extra time for working on a conference talk. That talk is today, so I'm hoping to be able to catch up on this again soon. James Coleman
JSON parser discards value of string token
Although the following problem does not seem to be exposed in the core, I think it's still a problem to fix. (I've hit it when implementing a custom parser for extension configuration file.) If makeJsonLexContextCstringLen() is passed need_escapes=false, JsonLexContext.strval is not initialized, and in turn, functions of JsonSemAction which should receive the string token value (e.g. object_field_start) receive NULL. Attached is a patch that fixes the problem. If this approach is acceptable, then it'd probably be worth to also rename the JsonLexContext.strval field to something that recalls the "de-escaping", e.g. "noesc"? -- Antonin Houska Web: https://www.cybertec-postgresql.com diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 26d293709a..2ef16fb089 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -142,17 +142,26 @@ lex_accept(JsonLexContext *lex, JsonTokenType token, char **lexeme) { if (lexeme != NULL) { - if (lex->token_type == JSON_TOKEN_STRING) - { -if (lex->strval != NULL) - *lexeme = pstrdup(lex->strval->data); - } + if (lex->token_type == JSON_TOKEN_STRING && lex->strval != NULL) +*lexeme = pstrdup(lex->strval->data); else { int len = (lex->token_terminator - lex->token_start); -char *tokstr = palloc(len + 1); +char *src = lex->token_start; +char *tokstr; + +/* String token should be quoted. */ +if (lex->token_type == JSON_TOKEN_STRING) +{ + Assert(len >= 2); + Assert(src[0] == '"' && src[len - 1] == '"'); + + src++; + len -= 2; +} -memcpy(tokstr, lex->token_start, len); +tokstr = palloc(len + 1); +memcpy(tokstr, src, len); tokstr[len] = '\0'; *lexeme = tokstr; }
Leakproofness of texteq()/textne()
Currently, texteq() and textne() are marked leakproof, while sibling operations such as textlt() are not. The argument for that was that those two functions depend only on memcmp() so they can be seen to be safe, whereas it's a whole lot less clear that text_cmp() should be considered leakproof. Of course, the addition of nondeterministic collations has utterly obliterated that argument: it's now possible for texteq() to call text_cmp(), so if the latter is leaky then the former certainly must be considered so as well. Seems like we have two choices: 1. Remove the leakproof marking on texteq()/textne(). This would, unfortunately, be catastrophic for performance in a lot of scenarios where leakproofness matters. 2. Fix text_cmp to actually be leakproof, whereupon we might as well mark all the remaining btree comparison operators leakproof. I think #2 is pretty obviously the superior answer, if we can do it. ISTM we can't ship v12 without dealing with this one way or the other, so I'll go add an open item. Comments? regards, tom lane
Re: Implementing Incremental View Maintenance
On 2019-Aug-06, Tatsuo Ishii wrote: > It's not mentioned below but some bugs including seg fault when > --enable-casser is enabled was also fixed in this patch. > > BTW, I found a bug with min/max support in this patch and I believe > Yugo is working on it. Details: > https://github.com/sraoss/pgsql-ivm/issues/20 So is he posting an updated patch soon? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: psql - improve test coverage from 41% to 88%
I think the TestLib.pm changes should be done separately, not together with the rest of the hacking in this patch. Mostly, because I think they're going to cause trouble. Adding a parameter in the middle of the list may cause trouble for third-party users of TestLib. I propose that we make the routines a bit smarter to cope with the API change: use named parameters instead. And in order to do that without having to change existing users of command_check, make it so that the routine checks whether the parameter is a hashref, and behave differently. So when called as in the existing callsites (five scalar paramters) it behaves as currently. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: (Re)building index using itself or another index of the same table
Arseny Sher writes: > A problem of similar nature can be reproduced with the following > stripped-down scenario: > CREATE TABLE pears(f1 int primary key, f2 int); > INSERT INTO pears SELECT i, i+1 FROM generate_series(1, 100) i; > CREATE OR REPLACE FUNCTION pears_f(i int) RETURNS int LANGUAGE SQL IMMUTABLE > AS $$ > SELECT f1 FROM pears WHERE pears.f2 = 42 > $$; > CREATE index ON pears ((pears_f(f1))); We've seen complaints about this sort of thing before, and rejected them because, as you say, that function is NOT immutable. When you lie to the system like that, you should not be surprised if things break. > There is already a mechanism which prevents usage of indexes during > reindex -- ReindexIsProcessingIndex et al. However, to the contrary of > what index.c:3664 comment say, these protect only indexes on system > catalogs, not user tables: the only real caller is genam.c. > Attached patch extends it: the same check is added to > get_relation_info. Also SetReindexProcessing is cocked in index_create > to defend from index self usage during creation as in stripped example > above. There are some other still unprotected callers of index_build; > concurrent index creation doesn't need it because index is > 'not indisvalid' during the build, and in RelationTruncateIndexes > table is empty, so it looks like it can be omitted. I have exactly no faith that this fixes things enough to make such cases supportable. And I have no interest in opening that can of worms anyway. I'd rather put in some code to reject database accesses in immutable functions. > One might argue that function selecting from table can hardly be called > immutable, and immutability is required for index expressions. However, > if user is sure table contents doesn't change, why not? If the table contents never change, why are you doing VACUUM FULL on it? regards, tom lane
(Re)building index using itself or another index of the same table
Hi, Our customer encountered a curious scenario. They have a table with GIN index on expression, which performs multiple joins with this table itself. These joins employ another btree index for efficiency. VACUUM FULL on this table fails with error like ERROR: could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes It happens because order in which indexes are rebuilt is not specified. GIN index is being rebuilt when btree index is not reconstructed yet; an attempt to use old index with rewritten heap crashes. A problem of similar nature can be reproduced with the following stripped-down scenario: CREATE TABLE pears(f1 int primary key, f2 int); INSERT INTO pears SELECT i, i+1 FROM generate_series(1, 100) i; CREATE OR REPLACE FUNCTION pears_f(i int) RETURNS int LANGUAGE SQL IMMUTABLE AS $$ SELECT f1 FROM pears WHERE pears.f2 = 42 $$; CREATE index ON pears ((pears_f(f1))); Here usage of not-yet-created index on pears_f(f1) for its own construction is pointless, however planner in principle considers it in get_relation_info, tries to get btree height (_bt_getrootheight) -- and fails. There is already a mechanism which prevents usage of indexes during reindex -- ReindexIsProcessingIndex et al. However, to the contrary of what index.c:3664 comment say, these protect only indexes on system catalogs, not user tables: the only real caller is genam.c. Attached patch extends it: the same check is added to get_relation_info. Also SetReindexProcessing is cocked in index_create to defend from index self usage during creation as in stripped example above. There are some other still unprotected callers of index_build; concurrent index creation doesn't need it because index is 'not indisvalid' during the build, and in RelationTruncateIndexes table is empty, so it looks like it can be omitted. One might argue that function selecting from table can hardly be called immutable, and immutability is required for index expressions. However, if user is sure table contents doesn't change, why not? Also, the possiblity of triggering "could not read block" error with plain SQL is definitely not nice. >From 5942a3a5b2c90056119b9873c81f30dfa9e003af Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 12 Sep 2019 17:35:16 +0300 Subject: [PATCH] Avoid touching user indexes while they are being (re)built. Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus enforced only for system catalogs. Check it also in the planner, so that indexes which are currently being rebuilt are never used. Also cock SetReindexProcessing in index_create to defend from index self usage during its creation. Without this, VACUUM FULL or just CREATE INDEX might fail with something like ERROR: could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes if there are indexes which usage can be considered during these very indexes (re)building, i.e. index expression scans indexed table. --- src/backend/catalog/index.c| 22 -- src/backend/optimizer/util/plancat.c | 5 + src/test/regress/expected/create_index.out | 12 src/test/regress/sql/create_index.sql | 13 + 4 files changed, 50 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 3e1d40662d..5bc764ce46 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1174,7 +1174,22 @@ index_create(Relation heapRelation, } else { - index_build(heapRelation, indexRelation, indexInfo, false, true); + /* ensure SetReindexProcessing state isn't leaked */ + PG_TRY(); + { + /* Suppress use of the target index while building it */ + SetReindexProcessing(heapRelationId, indexRelationId); + + index_build(heapRelation, indexRelation, indexInfo, false, true); + } + PG_CATCH(); + { + /* Make sure flag gets cleared on error exit */ + ResetReindexProcessing(); + PG_RE_THROW(); + } + PG_END_TRY(); + ResetReindexProcessing(); } /* @@ -1379,7 +1394,10 @@ index_concurrently_build(Oid heapRelationId, indexInfo->ii_Concurrent = true; indexInfo->ii_BrokenHotChain = false; - /* Now build the index */ + /* + * Now build the index + * SetReindexProcessing is not required since indisvalid is false anyway + */ index_build(heapRel, indexRelation, indexInfo, false, true); /* Close both the relations, but keep the locks */ diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index cf1761401d..9d58cd2574 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -27,6 +27,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/dependency.h" +#include "catalog/index.h" #include "catalog/heap.h" #include "catalog/pg_am.h" #include "catalog/pg_proc.h" @@ -193,6 +194,10 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On 2019-Sep-12, Tomas Vondra wrote: > On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant > wrote: > > On 2019-Sep-11, Amit Khandekar wrote: > > I think doing this all the time would make restore very slow -- there's a > > reason we keep the files open, after all. > > How much slower? It certainly will have a hit, but maybe it's negligible > compared to all the other stuff happening in this code? I dunno -- that's a half-assed guess based on there being many more syscalls, and on the fact that the API is how it is in the first place. (Andres did a lot of perf benchmarking and tweaking back when he was writing this ... I just point out that I have a colleague that had to invent *a lot* of new MemCxt infrastructure in order to make some of Andres' perf optimizations cleaner, just as a semi-related data point. Anyway, I digress.) Anyway, such a fix would pessimize all cases, including every single case that works today (which evidently is almost every single usage of this feature, since we hadn't heard of this problem until yesterday), in order to solve a problem that you can only find in very rare ones. Another point of view is that we should make it work first, then make it fast. But the point remains that it works fine and fast for 99.99% of cases. > > It would be better if we can keep the descriptors open as much as > > possible, and only close them if there's trouble. I was under the > > impression that using OpenTransientFile was already taking care of that, > > but that's evidently not the case. > > I don't see how the current API could do that transparently - it does > track the files, but the user only gets a file descriptor. With just a > file descriptor, how could the code know to do reopen/seek when it's going > just through the regular fopen/fclose? Yeah, I don't know what was in Amit's mind, but it seemed obvious to me that such a fix required changing that API so that a seekpos is kept together with the fd. ReorderBufferRestoreChange is static in reorderbuffer.c so it's not a problem to change its ABI. I agree with trying to get a reorderbuffer-localized back-patchable fix first, then we how to improve from that. > As a sidenote - in the other thread about streaming, one of the patches > does change how we log subxact assignments. In the end, this allows using > just a single file for the top-level transaction, instead of having one > file per subxact. That would also solve this. :-( While I would love to get that patch done and get rid of the issue, it's not a backpatchable fix either. ... however, it does mean we maybe can get away with the reorderbuffer.c-local fix, and then just use your streaming stuff to get rid of the perf problem forever? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Runtime pruning problem
Alvaro Herrera writes: > So is anyone working on a patch to use this approach? It's on my to-do list, but I'm not sure how soon I'll get to it. regards, tom lane
Re: Runtime pruning problem
On 2019-Jul-30, Tom Lane wrote: > I wrote: > > This may be arguing for a change in ruleutils' existing behavior, > > not sure. But when dealing with traditional-style inheritance, > > I've always thought that Vars above the Append were referring to > > the parent rel in its capacity as the parent, not in its capacity > > as the first child. With new-style partitioning drawing a clear > > distinction between the parent and all its children, it's easier > > to understand the difference. > > OK, so experimenting, I see that it is a change: [...] > The portion of this below the Append is fine, but I argue that > the Vars above the Append should say "part", not "part_p1". > In that way they'd look the same regardless of which partitions > have been pruned or not. So is anyone working on a patch to use this approach? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Misleading comment in tuplesort_set_bound
Alvaro Herrera writes: > [ shrug ] It seemed to require no further work, so I just pushed Tom's > proposed change. > I added an empty line after the new combined assertion, which makes > clearer (to me anyway) that the other assertions are unrelated. Actually, the thing I wanted to add was some actual comments for those other assertions. But that requires a bit of research that I hadn't made time for... regards, tom lane
Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support
On Wed, Sep 11, 2019 at 9:45 PM Nino Floris wrote: > > * We currently don't add new extension SQL-script for new extension > > version (i.e. ltree--1.2.sql). Instead, we provide just a migration > > script (i.e. ltree--1.1--1.2.sql). This simplifies testing of > > extension migration – plain extension creation implies migration. > > I wasn't sure how to add new methods to the type without doing a full > CREATE TYPE, as ALTER TYPE doesn't allow updates to functions. At that > point wouldn't it be better as a new version? > I checked some other extensions like hstore to find reference material > how to do a new CREATE TYPE, all did a full version bump. > Should I just do a DROP and CREATE instead in a migration? Oh, I appears we didn't add send/recv to any pluggable datatype one we get extension system. Ideally we need to adjust ALTER TYPE command so that it could add send/recv functions to an existing type. I see couple other options: 1) Write migration script, which directly updates pg_type. 2) Add ltree--1.2.sql and advise users to recreate extension to get binary protocol support. But both these options are cumbersome. What do you think about writing patch for ALTER TYPE? > > * What is motivation for binary format for lquery and ltxtquery? One > could transfer large datasets of ltree's. But is it so for lquery and > ltxtquery, which are used just for querying. > > Completeness, Npgsql (and other drivers like Ecto from Elixir and > probably many others as well) can't do any text fallback in the binary > protocol without manual configuration. > This is because these drivers don't know much (or anything) about the > SQL they send so it wouldn't know to apply it for which columns. > I believe there has been a proposal at some point to enhance the > binary protocol to additionally allow clients to specify text/binary > per data type instead of per column. > That would allow all these drivers to automate this, but I think it > never went anywhere. > As it stands it's not ergonomic to ask developers to setup this > metadata per query they write. > > * Just send binary representation of datatype is not OK. PostgreSQL > supports a lot of platforms with different byte ordering, alignment > and so on. You basically need to send each particular field using > pq_send*() function. > > Oh my, I don't think I did? I copied what jsonb is doing, > specifically, sending the textual representation of the data type with > a version field prefixed. > (It is why I introduced deparse_ltree/lquery, to take the respective > structure and create a null terminated string of its textual form) > So there are no other fields besides version and the string > representation of the structure. > ltree, lquery, and ltxtquery all seem to have a lossless and compact > textual interpretation. > My motivation here has been "if it's good enough for jsonb it should > be good enough for a niche extension like ltree" especially as having > any binary support is better than not having it at all. > I can change it to anything you'd like to see instead but I would need > some pointers as to what that would be. Oh, sorry. I didn't notice you send textual representation for ltree, lquery, and ltxtquery. And the point is not performance for these datatypes, but completeness. Now I got it, then it looks OK. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Write visibility map during CLUSTER/VACUUM FULL
On Wed, Sep 11, 2019 at 3:30 PM Amit Kapila wrote: > On Sun, Sep 1, 2019 at 1:37 PM Alexander Korotkov > wrote: > > I found it weird that CLUSTER/VACUUM FULL don't write visibility map. > > Attached patch implements writing visibility map in > > heapam_relation_copy_for_cluster(). > > > > I've studied previous attempt to implement this [1]. The main problem > > of that attempt was usage of existing heap_page_is_all_visible() and > > visibilitymap_set() functions. These functions works through buffer > > manager, while heap rewriting is made bypass buffer manager. > > > > In my patch visibility map pages are handled in the same way as heap > > pages are. > > > > I haven't studied this patch in detail, but while glancing I observed > that this doesn't try to sync the vm pages as we do for heap pages in > the end (during end_heap_rewrite). Am I missing something? You're not missed anything. Yes, VM need sync. Will fix this. And I just noticed I need a closer look to what is going on with TOAST. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra wrote: > I don't see how the current API could do that transparently - it does > track the files, but the user only gets a file descriptor. With just a > file descriptor, how could the code know to do reopen/seek when it's going > just through the regular fopen/fclose? > > Anyway, I agree we need to do something, to fix this corner case (many > serialized in-progress transactions). ISTM we have two options - either do > something in the context of reorderbuffer.c, or extend the transient file > API somehow. I'd say the second option is the right thing going forward, > because it does allow doing it transparently and without leaking details > about maxAllocatedDescs etc. There are two issues, though - it does > require changes / extensions to the API, and it's not backpatchable. > > So maybe we should start with the localized fix in reorderbuffer, and I > agree tracking offset seems reasonable. We've already got code that knows how to track this sort of thing. You just need to go through the File abstraction (PathNameOpenFile or PathNameOpenFilePerm or OpenTemporaryFile) rather than using the functions that deal directly with fds (OpenTransientFile, BasicOpenFile, etc.). It seems like it would be better to reuse the existing VFD layer than to invent a whole new one specific to logical replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Bug in GiST paring heap comparator
On Wed, Sep 11, 2019 at 3:34 AM Nikita Glukhov wrote: > On 09.09.2019 22:47, Alexander Korotkov wrote: > > On Mon, Sep 9, 2019 at 8:32 PM Nikita Glukhov wrote: > > On 08.09.2019 22:32, Alexander Korotkov wrote: > > On Fri, Sep 6, 2019 at 5:44 PM Alexander Korotkov > wrote: > > I'm going to push both if no objections. > > So, pushed! > > Two years ago there was a similar patch for this issue: > https://www.postgresql.org/message-id/1499c9d0-075a-3014-d2aa-ba59121b3728%40postgrespro.ru > > Sorry that I looked at this thread too late. > > I see, thanks. > > You patch seems a bit less cumbersome thanks to using GISTDistance > struct. You don't have to introduce separate array with null flags. > But it's harder too keep index_store_float8_orderby_distances() used > by both GiST and SP-GiST. > > What do you think? You can rebase your patch can propose that as refactoring. > > Rebased patch with refactoring is attached. > > During this rebase I found that SP-GiST code has similar problems with NULLs. > All SP-GiST functions do not check SK_ISNULL flag of ordering ScanKeys, and > that leads to segfaults. In the following test, index is searched with > non-NULL key first and then with NULL key, that leads to a crash: > > CREATE TABLE t AS SELECT point(x, y) p > FROM generate_series(0, 100) x, generate_series(0, 100) y; > > CREATE INDEX ON t USING spgist (p); > > -- crash > SELECT (SELECT p FROM t ORDER BY p <-> pt LIMIT 1) > FROM (VALUES (point '1,2'), (NULL)) pts(pt); > > > After adding SK_ISNULL checks and starting to produce NULL distances, we need > to > store NULL flags for distance somewhere. Existing singleton flag for the > whole > SPGistSearchItem is not sufficient anymore. > > > So, I introduced structure IndexOrderByDistance and used it everywhere in the > GiST and SP-GiST code instead of raw double distances. > > > SP-GiST order-by-distance code can be further refactored so that user > functions > do not have to worry about SK_ISNULL checks. It doesn't seem SP-GiST inner_consistent and leaf_consistent functions can handle NULL scan keys for now. So should we let them handle NULL orderby keys? If we assume that NULL arguments produce NULL distances, we can evade changing the code of opclasses. Also, I've noticed IndexOrderByDistance is missed in typedefs.list. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Misleading comment in tuplesort_set_bound
On 2019-Sep-05, James Coleman wrote: > Yes, planning on it, just a bit behind right now so will likely be a > few more days at least. [ shrug ] It seemed to require no further work, so I just pushed Tom's proposed change. I added an empty line after the new combined assertion, which makes clearer (to me anyway) that the other assertions are unrelated. Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: block-level incremental backup
Hi, One of my colleague at EDB, Rajkumar Raghuwanshi, while testing this feature reported an issue. He reported that if a full base-backup is taken, and then created a database, and then took an incremental backup, combining full backup with incremental backup is then failing. I had a look over this issue and observed that when the new database is created, the catalog files are copied as-is into the new directory corresponding to a newly created database. And as they are just copied, the LSN on those pages are not changed. Due to this incremental backup thinks that its an existing file and thus do not copy the blocks from these new files, leading to the failure. I have surprised to know that even though we are creating new files from old files, we kept the LSN unmodified. I didn't see any other parameter in basebackup which tells that this is a new file from last LSN or something. I tried looking for any other DDL doing similar stuff like creating a new page with existing LSN. But I could not find any other commands than CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE. Suggestions/thoughts? -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: SegFault on 9.6.14
On Thu, Sep 12, 2019 at 8:55 AM Amit Kapila wrote: > Robert, Thomas, do you have any more suggestions related to this. I > am planning to commit the above-discussed patch (Forbid Limit node to > shutdown resources.) coming Monday, so that at least the reported > problem got fixed. I think that your commit message isn't very clear about what the actual issue is. And the patch itself doesn't add any comments or anything to try to clear it up. So I wouldn't favor committing it in this form. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Close stdout and stderr in syslogger
=?utf-8?B?0KHQstGP0YLQvtGB0LvQsNCyINCV0YDQvNC40LvQuNC9?= writes: > Hi! Few months ago we have encountered > situation when some quite big open log files were open by Postres despite > being deleted.This affects free space caluculation in out managed > PostgreSQL instances.Currently I'm investigating this > issue.We traced some roots to unclosed descriptors in Perl code of > postgresql-common [0], but we still face some unclosed descriptors pointing > to log file. Debian tools from postgresql-common starts > pg_ctl piped to logfile. Descriptor is piped to logfile and block it for > delete.That is why we can't delete logfile.1 after > logrotate.And after second logrotate logfile.1 is in "deleted" > status, but can't be deleted in fact. Here I apply path > with possible solution. In this patch stdout and stderr pipes are just closed > in syslogger. --Sviatoslav > ErmilinYandex [0] > https://salsa.debian.org/postgresql/postgresql-common/commit/580aa0677edc222ebaf6e1031cf3929f847f27fb I'm quite certain that the current behavior is intentional, if only because closing the syslogger's stderr would make it impossible to debug problems inside the syslogger. Why is it a problem that we leave it open? I don't believe either that the file will grow much (in normal cases anyway), or that it's impossible to unlink it (except on Windows, but you didn't say anything about Windows). In any case, the proposed patch almost certainly introduces new problems, in that you dropped the fcloses's into code that executes repeatedly. regards, tom lane
Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"
On Thu, Sep 12, 2019 at 3:39 PM Euler Taveira wrote: > Em seg, 10 de jun de 2019 às 14:34, Alexander Korotkov > escreveu: > > > > Pushed! > > > Alexander, this commit is ok for 11 and so. However, GUC > strict_word_similarity_threshold does not exist in 9.6 and 10. The > attached patch revert this part. It should apply cleanly in 9.6 and > 10. Thank you for pointing this out. Pushed. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Amcheck: do rightlink verification with lock coupling
On 2019-Sep-12, Andrey Borodin wrote: > This patch violates one of amcheck design principles: current code > does not ever take more than one page lock. I do not know: should we > hold this rule or should we use more deep check? The check does seem worthwhile to me. As far as I know, in btree you can lock the right sibling of a page while keeping lock on the page itself, without risk of deadlock. So I'm not sure what's the issue with that. This is in the README: In most cases we release our lock and pin on a page before attempting to acquire pin and lock on the page we are moving to. In a few places it is necessary to lock the next page before releasing the current one. This is safe when moving right or up, but not when moving left or down (else we'd create the possibility of deadlocks). I suppose Peter was concerned about being able to run amcheck without causing any trouble at all for concurrent operation; maybe we can retain that property by making this check optional. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amcheck: do rightlink verification with lock coupling
Hi! This is a thread to discuss amcheck feature started in other thread[0]. Currently amcheck is scanning every B-tree level. If verification is done with ShareLock - amcheck will test that each page leftlink is pointing to page with rightlink backwards. This is important invariant, in our experience it proved to be good at detecting various corruptions. But this invariant can be detected only if both pages are not modified (e.g. split concurrently). PFA patch, that in case of suspicion locks two pages and retests that invariant. This allows detection of several corruptions on standby. This patch violates one of amcheck design principles: current code does not ever take more than one page lock. I do not know: should we hold this rule or should we use more deep check? Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/da9b33ac-53cb-4643-96d4-7a0bbc037...@yandex-team.ru v2-0001-In-amcheck-nbtree-do-rightlink-verification-with-loc.patch Description: Binary data
Re: tableam vs. TOAST
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund wrote: > The obvious problem with this is the slice fetching logic. For slices > with an offset of 0, it's obviously trivial to implement. For the higher > slice logic, I'd assume we'd have to fetch the first slice by estimating > where the start chunk is based on the current suggest chunk size, and > restarting the scan earlier/later if not accurate. In most cases it'll > be accurate, so we'd not loose efficiency. Dilip and I were discussing this proposal this morning, and after talking to him, I don't quite see how to make this work without significant loss of efficiency. Suppose that, based on the estimated chunk size, you decide that there are probably 10 chunks and that the value that you need is probably located in chunk 6. So you fetch chunk 6. Happily, chunk 6 is the size that you expect, so you extract the bytes that you need and go on your way. But ... what if there are actually 6 chunks, not 10, and the first five are bigger than you expected, and the reason why the size of chunk 6 matched your expectation because it was the last chunk and thus smaller than the rest? Now you've just returned the wrong answer. AFAICS, there's no way to detect that except to always read at least two chunks, which seems like a pretty heavy price to pay. It doesn't cost if you were going to need to read at least 2 chunks anyway, but if you were only going to need to read 1, having to fetch 2 stinks. Actually, when I initially read your proposal, I thought you were proposing to relax things such that the chunks didn't even have to all be the same size. That seems like it would be something cool that could potentially be leveraged not only by AMs but perhaps also by data types that want to break up their data strategically to cater to future access patterns. But then there's really no way to make slicing work without always reading from the beginning. There would be no big problem here - in either scenario - if each chunk contained the byte-offset of that chunk relative to the start of the datum. You could guess wildly and always know whether or not you had got it right without reading any extra data. But that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SegFault on 9.6.14
On Thu, Sep 5, 2019 at 7:53 PM Amit Kapila wrote: > > On Mon, Sep 2, 2019 at 4:51 PM Amit Kapila wrote: > > > > On Fri, Aug 9, 2019 at 6:29 PM Robert Haas wrote: > > > > > > > > > But beyond that, the issue here is that the Limit node is shutting > > > down the Gather node too early, and the right fix must be to stop > > > doing that, not to change the definition of what it means to shut down > > > a node, as this patch does. So maybe a possible approach here - which > > > I think is more or less what Tom is proposing - is: > > > > > > 1. Remove the code from ExecLimit() that calls ExecShutdownNode(). > > > > > > > Attached patch does that. I have also added one test as a separate > > patch so that later if we introduce shutting down resources in Limit > > node, we don't break anything. As of now, I have kept it separate for > > easy verification, but if we decide to go with this approach and test > > appears fine, we can merge it along with the fix. > > > > I have merged the code change and test case patch as I felt that it is > good to cover this case. I have slightly changed the test case to > make its output predictable (made the inner scan ordered so that the > query always produces the same result). One more thing I am not able > to come up with some predictable test case for 9.6 branches as it > doesn't support Gather Merge which is required for this particular > test to always produce predictable output. There could be some better > way to write this test, so any input in that regards or otherwise is > welcome. So, if we commit this patch the containing test case will be > for branches HEAD~10, but the code will be for HEAD~9.6. > Robert, Thomas, do you have any more suggestions related to this. I am planning to commit the above-discussed patch (Forbid Limit node to shutdown resources.) coming Monday, so that at least the reported problem got fixed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"
Em seg, 10 de jun de 2019 às 14:34, Alexander Korotkov escreveu: > > Pushed! > Alexander, this commit is ok for 11 and so. However, GUC strict_word_similarity_threshold does not exist in 9.6 and 10. The attached patch revert this part. It should apply cleanly in 9.6 and 10. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento From 26f77930fa668c329694da43697361877dfb5ac0 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Thu, 12 Sep 2019 09:12:59 -0300 Subject: [PATCH] Fix pg_trgm documentation Revert part of the commit 9ee98cc3f because it introduced a GUC description that does not exist on this version. --- doc/src/sgml/pgtrgm.sgml | 17 - 1 file changed, 17 deletions(-) diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml index 9f9482ef27..b8e0df6c03 100644 --- a/doc/src/sgml/pgtrgm.sgml +++ b/doc/src/sgml/pgtrgm.sgml @@ -263,23 +263,6 @@ - - - pg_trgm.strict_word_similarity_threshold (real) - - - pg_trgm.strict_word_similarity_threshold configuration parameter - - - - - - Sets the current strict word similarity threshold that is used by the - % and % operators. The threshold - must be between 0 and 1 (default is 0.5). - - - -- 2.11.0
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Thu, Sep 12, 2019 at 4:48 PM Michael Paquier wrote: > > On Thu, Sep 12, 2019 at 05:34:08PM +0800, Masahiko Sawada wrote: > > On Thu, Sep 12, 2019 at 1:56 PM Amit Kapila wrote: > >> I think it is better to use a message like "must be superuser to use > >> pageinspect functions" as this function doesn't take raw page as > >> input. If you see other functions like bt_page_items which doesn't > >> take raw page as input has the message which I am suggesting. I can > >> see that tuple_data_split also has a similar message as you are > >> proposing, but I think that is also not very appropriate. > > > > Agreed. Will fix. > > Well, those functions are all able to work only from data of a raw > page, so the existing message was actually fine by me. If you want to > change it this way, I don't see either any arguments against. > > > Hmm my understanding of 'decode_combined' is to decode the flags that > > we represent by using multiple flags. HEAP_XMAX_IS_LOCKED_ONLY is true > > either if HEAP_XMAX_LOCK_ONLY is set or not a multi and the EXCL_LOCK > > bit is set. That is it requires only one flag. So I thought that it's > > not a combined flag. > > Same interpretation here. > Hmm, I thought when decode_combined flag is set to false means we will display the raw flags set on the tuple without any further interpretation. I think that is what is most people in thread advocated about. > >> I am not completely sure whether we want to cover HEAP_LOCKED_UPGRADED > >> case even when decode_combined flag is set. It seems this is a bit > >> more interpretation of flags than we are doing in other cases. For > >> example, other cases like HEAP_XMAX_SHR_LOCK or HEAP_XMIN_FROZEN are > >> the flags that are explicitly set on the tuple so displaying them > >> makes sense, but the same is not true for HEAP_LOCKED_UPGRADED. > > > > I thought it would be better to interpret it as much as possible, > > especially for diagnostic use cases. I'm concerned that user might not > > be able to get enough information for investigation if we > > intentionally filtered particular flags. > > For HEAP_LOCKED_UPGRADED, my interpretation was that the current code > is correct to understand it as a decomposition of HEAP_XMAX_IS_MULTI > and HEAP_XMAX_LOCK_ONLY, still... > > It seems to me that the way we define combined flags is subject to > a lot of interpretation. > Right. > Honestly, if we cannot come up with a clear > definition of what should be combined or not, I would be of the > opinion to just wipe out the option, and just return in the text array > the bits which are set. It has been discussed on the thread that it > would be confusing to not show combined flags to some users as some > bits set have rather contrary meaning when set with others. > Yes, I think we could have more discussion on this point. It is not 100% clear how we should interpret this flag and or where to draw a line. It might be that whatever we have done is alright, but still, it is worth more discussion and opinion from a few more people. > We tell > the user that all the flag details are defined in htup_details.h in > the code and the documentation so the information is not in the > returned data, but in the code. And I would like to think that users > of pageinspect are knowledgeable enough about Postgres that they would > likely never use decode_combined = true. Likely I am outnumbered > regarding this point, so I won't push hard on it, still I get that the > confusion does not come from this module, but in the way the code > combines and names all the bits for the infomasks :) > > And there would be the argument to not use HEAP_XMAX_IS_LOCKED_ONLY() > in the code. > > >> I am not very happy with the parameter name 'decode_combined'. It is > >> not clear from the name what it means and I think it doesn't even > >> match with what we are actually doing here. How about raw_flags, > >> raw_tuple_flags or something like that? > >> > > > > raw_flags might be more straightforward. Or perhaps the third idea > > could be show_raw_flags? If other hackers agree to change the flag > > name I'll fix it. > > > > I'll submit the patch to fix the commit after we got a consensus on > > the above changes. > > decode_combined sounds like a good compromise to me. If there is a > better consensus, well, let's use it, but I don't find those > suggestions to be improvements. > I think it depends on the meaning of that flag. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Thu, Sep 12, 2019 at 2:05 PM Kyotaro Horiguchi wrote: > > Hello. > > At Wed, 11 Sep 2019 17:26:44 +0300, Anastasia Lubennikova < a.lubennik...@postgrespro.ru> wrote in < a82a896b-93f0-c26c-b941-f56651313...@postgrespro.ru> > > 10.09.2019 14:42, Asim R P wrote: > > > Hi Anastasia > > > > > > On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova > > > mailto:a.lubennik...@postgrespro.ru>> > > > wrote: > > > > > > > > But during the review, I found a bug in the current implementation. > > > > New behavior must apply to crash-recovery only, now it applies to > > > > archiveRecovery too. > > > > That can cause a silent loss of a tablespace during regular standby > > > > operation > > > > since it never calls CheckRecoveryConsistency(). > > Yeah. We should take the same steps with redo operations on > missing pages. Just record failure during inconsistent state then > forget it if underlying tablespace is gone. If we had a record > when we reached concsistency, we're in a serious situation and > should stop recovery. log_invalid_page forget_invalid_pages and > CheckRecoveryConsistency are the entry points of the feature to > understand. > Yes, I get it now. I will adjust the patch written by Paul accordingly. > > # I remember that the start point of this patch is a crash after > # table space drop subsequent to several operations within the > # table space. Then, crash recovery fails at an operation in the > # finally-removed tablespace. Is it right? > That's correct. Once the directories are removed from filesystem, any attempt to replay WAL records that depend on their existence fails. > > > But before that I'm revisiting another solution upthread, that of > > > creating restart points when replaying create/drop database commands > > > before making filesystem changes such as removing a directory. The > > > restart points should align with checkpoints on master. The concern > > > against this solution was creation of restart points will slow down > > > recovery. I don't think crash recovery is affected by this solution > > > because of the already existing enforcement of checkpoints. WAL > > > records prior to a create/drop database will not be seen by crash > > > recovery due to the checkpoint enforced during the command's normal > > > execution. > > > > > > > I haven't measured the impact of generating extra restart points in > > previous solution, so I cannot tell whether concerns upthread are > > justified. Still, I enjoy latest design more, since it is clear and > > similar with the code of checking unexpected uninitialized pages. In > > principle it works. And the issue, I described in previous review can > > be easily fixed by several additional checks of InHotStandby macro. > > Generally we shouldn't trigger useless restart point for > uncertain reasons. If standby crashes, it starts the next > recovery from the latest *restart point*. Even in that case what > we should do is the same. > The reason is quite clear to me - removing directories from filesystem break the ability to replay WAL records second time. And we already create checkpoints during normal operation in such a case, so crash recovery on a master node does not suffer from this bug. I've attached a patch that performs restart points during drop database replay, just for reference. It passes both the TAP tests written by Kyotaro and Paul. I had to modify drop database WAL record a bit. Asim v1-0001-Create-restartpoint-when-replaying-drop-database.patch Description: Binary data v1-0002-Test-to-validate-replay-of-WAL-records-involving-.patch Description: Binary data
Re: Create collation reporting the ICU locale display name
Michael Paquier wrote: > On Wed, Sep 11, 2019 at 04:53:16PM +0200, Daniel Verite wrote: > > I think it would be nice to have CREATE COLLATION report this > > information as feedback in the form of a NOTICE message. > > PFA a simple patch implementing that. > > Why is that better than the descriptions provided with \dO[S]+ in > psql? There is no description for collations created outside of pg_import_system_collations(). Example: db=# create collation mycoll(provider=icu, locale='fr-FR-u-ks-level1'); NOTICE: ICU locale: "French (France, colstrength=primary)" db=# \x auto db=# \dO+ List of collations -[ RECORD 1 ]--+-- Schema | public Name | mycoll Collate| fr-FR-u-ks-level1 Ctype | fr-FR-u-ks-level1 Provider | icu Deterministic? | yes Description| The NOTICE above is with the patch. Otherwise, the "display name" is never shown nor stored anywhere AFAICS. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: psql - improve test coverage from 41% to 88%
Here is a v5. Few more in icommand_checks subroutine: Few unwanted code can be removed. Indeed, more debug and test code. Attached v6 fixes these, and I checked for remaining scrubs without finding any. -- Fabien.diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index b7d36b65dd..aabc27ab6f 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -506,6 +506,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ], 1, + '', [qr{^$}], [qr/^WARNING.*checksum verification failed/s], 'pg_basebackup reports checksum mismatch'); @@ -526,6 +527,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2" ], 1, + '', [qr{^$}], [qr/^WARNING.*further.*failures.*will.not.be.reported/s], 'pg_basebackup does not report more than 5 checksum mismatches'); @@ -542,6 +544,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3" ], 1, + '', [qr{^$}], [qr/^WARNING.*7 total checksum verification failures/s], 'pg_basebackup correctly report the total number of checksum mismatches'); diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl index 59228b916c..cf4811d382 100644 --- a/src/bin/pg_checksums/t/002_actions.pl +++ b/src/bin/pg_checksums/t/002_actions.pl @@ -62,6 +62,7 @@ sub check_relation_corruption '--filenode', $relfilenode_corrupted ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data for single relfilenode on tablespace $tablespace" @@ -71,6 +72,7 @@ sub check_relation_corruption $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data on tablespace $tablespace"); @@ -203,6 +205,7 @@ sub fail_corrupt $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/^$/], [qr/could not read block 0 in file.*$file\":/], "fails for corrupted data in $file"); diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl index 3b63ad230f..ebe9b80a52 100644 --- a/src/bin/pg_controldata/t/001_pg_controldata.pl +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl @@ -33,6 +33,7 @@ close $fh; command_checks_all( [ 'pg_controldata', $node->data_dir ], 0, + '', [ qr/WARNING: Calculated CRC checksum does not match value stored in file/, qr/WARNING: invalid WAL segment size/ diff --git a/src/bin/pg_resetwal/t/002_corrupted.pl b/src/bin/pg_resetwal/t/002_corrupted.pl index f9940d7fc5..1990669d26 100644 --- a/src/bin/pg_resetwal/t/002_corrupted.pl +++ b/src/bin/pg_resetwal/t/002_corrupted.pl @@ -30,6 +30,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it/ @@ -46,6 +47,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/\Qpg_resetwal: warning: pg_control specifies invalid WAL segment size (0 bytes); proceed with caution\E/ diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index b82d3f65c4..01010914fe 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -50,7 +50,7 @@ sub pgbench push @cmd, @args; - $node->command_checks_all(\@cmd, $stat, $out, $err, $name); + $node->command_checks_all(\@cmd, $stat, '', $out, $err, $name); # cleanup? #unlink @filenames or die "cannot unlink files (@filenames): $!"; diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl index f7fa18418b..b58f3548c3 100644 --- a/src/bin/pgbench/t/002_pgbench_no_server.pl +++ b/src/bin/pgbench/t/002_pgbench_no_server.pl @@ -25,7 +25,7 @@ sub pgbench my ($opts, $stat, $out, $err, $name) = @_; print STDERR "opts=$opts, stat=$stat, out=$out, err=$err, name=$name"; command_checks_all([ 'pgbench', split(/\s+/, $opts) ], - $stat, $out, $err, $name); + $stat, '', $out, $err, $name); return; } @@ -52,7 +52,7 @@ sub pgbench_scripts push @cmd, '-f', $filename; } } - command_checks_all(\@cmd, $stat, $out, $err, $name); + command_checks_all(\@cmd, $stat, '', $out, $err, $name); return; } diff --git a/src/bin/psql/.gitignore b/src/bin/psql/.gitignore index c2862b12d6..d324c1c1fa 100644 --- a/src/bin/psql/.gitignore +++ b/src/bin/psql/.gitignore @@ -3,3 +3,4 @@ /sql_help.c /psql +/tmp_check diff --git a/src/bin/psql/Makefile
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Thu, Sep 12, 2019 at 05:34:08PM +0800, Masahiko Sawada wrote: > On Thu, Sep 12, 2019 at 1:56 PM Amit Kapila wrote: >> I think it is better to use a message like "must be superuser to use >> pageinspect functions" as this function doesn't take raw page as >> input. If you see other functions like bt_page_items which doesn't >> take raw page as input has the message which I am suggesting. I can >> see that tuple_data_split also has a similar message as you are >> proposing, but I think that is also not very appropriate. > > Agreed. Will fix. Well, those functions are all able to work only from data of a raw page, so the existing message was actually fine by me. If you want to change it this way, I don't see either any arguments against. > Hmm my understanding of 'decode_combined' is to decode the flags that > we represent by using multiple flags. HEAP_XMAX_IS_LOCKED_ONLY is true > either if HEAP_XMAX_LOCK_ONLY is set or not a multi and the EXCL_LOCK > bit is set. That is it requires only one flag. So I thought that it's > not a combined flag. Same interpretation here. >> I am not completely sure whether we want to cover HEAP_LOCKED_UPGRADED >> case even when decode_combined flag is set. It seems this is a bit >> more interpretation of flags than we are doing in other cases. For >> example, other cases like HEAP_XMAX_SHR_LOCK or HEAP_XMIN_FROZEN are >> the flags that are explicitly set on the tuple so displaying them >> makes sense, but the same is not true for HEAP_LOCKED_UPGRADED. > > I thought it would be better to interpret it as much as possible, > especially for diagnostic use cases. I'm concerned that user might not > be able to get enough information for investigation if we > intentionally filtered particular flags. For HEAP_LOCKED_UPGRADED, my interpretation was that the current code is correct to understand it as a decomposition of HEAP_XMAX_IS_MULTI and HEAP_XMAX_LOCK_ONLY, still... It seems to me that the way we define combined flags is subject to a lot of interpretation. Honestly, if we cannot come up with a clear definition of what should be combined or not, I would be of the opinion to just wipe out the option, and just return in the text array the bits which are set. It has been discussed on the thread that it would be confusing to not show combined flags to some users as some bits set have rather contrary meaning when set with others. We tell the user that all the flag details are defined in htup_details.h in the code and the documentation so the information is not in the returned data, but in the code. And I would like to think that users of pageinspect are knowledgeable enough about Postgres that they would likely never use decode_combined = true. Likely I am outnumbered regarding this point, so I won't push hard on it, still I get that the confusion does not come from this module, but in the way the code combines and names all the bits for the infomasks :) And there would be the argument to not use HEAP_XMAX_IS_LOCKED_ONLY() in the code. >> I am not very happy with the parameter name 'decode_combined'. It is >> not clear from the name what it means and I think it doesn't even >> match with what we are actually doing here. How about raw_flags, >> raw_tuple_flags or something like that? >> > > raw_flags might be more straightforward. Or perhaps the third idea > could be show_raw_flags? If other hackers agree to change the flag > name I'll fix it. > > I'll submit the patch to fix the commit after we got a consensus on > the above changes. decode_combined sounds like a good compromise to me. If there is a better consensus, well, let's use it, but I don't find those suggestions to be improvements. -- Michael signature.asc Description: PGP signature
Close stdout and stderr in syslogger
Hi! Few months ago we have encountered situation when some quite big open log files were open by Postres despite being deleted.This affects free space caluculation in out managed PostgreSQL instances.Currently I'm investigating this issue.We traced some roots to unclosed descriptors in Perl code of postgresql-common [0], but we still face some unclosed descriptors pointing to log file. Debian tools from postgresql-common starts pg_ctl piped to logfile. Descriptor is piped to logfile and block it for delete.That is why we can't delete logfile.1 after logrotate.And after second logrotate logfile.1 is in "deleted" status, but can't be deleted in fact. Here I apply path with possible solution. In this patch stdout and stderr pipes are just closed in syslogger. --Sviatoslav ErmilinYandex [0] https://salsa.debian.org/postgresql/postgresql-common/commit/580aa0677edc222ebaf6e1031cf3929f847f27fbFrom ef44d999ce1d7dcfa515713e52d8b318d5f655af Mon Sep 17 00:00:00 2001 From: Ermilin Sviatoslav Date: Thu, 12 Sep 2019 11:58:05 +0500 Subject: [PATCH] Close stderr and stdout in syslogger. Debian tools from postgresql-common starts pg_ctl piped to logfile. Descriptor is piped to logfile and block it for delete. That is why we can't delete logfile.1 after logrotate. And after second logrotate logfile.1 is in "deleted" status, but can't be deleted in fact. --- src/backend/postmaster/syslogger.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index 14d72d38d7..e23fa96c71 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -1365,6 +1365,9 @@ logfile_rotate(bool time_based_rotation, int size_rotation_for) if (csvfilename) pfree(csvfilename); + fclose(stdout); + fclose(stderr); + update_metainfo_datafile(); set_next_rotation_time(); -- 2.20.1 (Apple Git-117)
Re: psql - improve test coverage from 41% to 88%
On Thu, Sep 12, 2019 at 2:15 PM Fabien COELHO wrote: > > > >> Ok. Rebased version added, with some minor changes to improve readability > >> (comments, variables). > > > > Few comments: [...] > > > > Commented line can be removed > > Commented lines can be removed > > ??? can be changed to some suitable heading > > tab-complation to be changed to tab-completion > > Commented lines can be removed, some more are present below these lines > > also. > > Thanks for this review. > > The lines were really tests I did that had some issues because of the way > the Expect module works, and are not useful for inclusion in the code > base. > > Here is a v5. Few more in icommand_checks subroutine: + + #$ps->slave->stty(qw(raw -echo)); + $ps->slave->stty(qw(raw)); + my $n = 0; + for my $test (@$inout) + { + #warn "test: @$test"; + my ($in, @out) = @$test; + $n++; + #warn "in: $in"; + #warn "out: @out"; + $ps->send($in); Few unwanted code can be removed. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: let's kill AtSubStart_Notify
At Thu, 12 Sep 2019 09:44:49 +0530, Dilip Kumar wrote in > On Wed, Sep 11, 2019 at 6:22 PM Robert Haas wrote: > > trivial subtransactions. I used this test case: > > > > \timing > > do $$begin for i in 1 .. 1000 loop begin null; exception when > > others then null; end; end loop; end;$$; > > > > I ran the test four times with and without the patch and took the > > median of the last three. This was an attempt to exclude effects due > > to starting up the database cluster. With the patch, the result was > > 3127.377 ms; without the patch, it was 3527.285 ms. That's a big > > enough difference that I'm wondering whether I did something wrong > > while testing this, so feel free to check my work and tell me whether > > I'm all wet. Still, I don't find it wholly unbelievable, because I've > > observed in the past that these code paths are lean enough that a few > > palloc() calls can make a noticeable difference, and the effect of > > this patch is to remove a few palloc() calls. > > I did not read the patch but run the same case what you have given and > I can see the similar improvement with the patch. > With the patch 8832.988, without the patch 10252.701ms (median of three > reading) I see the similar result. The patch let it run faster by about 25%. The gain is reduced to 3-6% by a crude check by adding { (in TopTxCxt) lcons(0, p1); lcons(0, p2); } to the place where AtSubStart_Notify was called and respective list_delete_first's just after the call to AtSubCommit_Notfiy. At least around 20% of the gain seems to be the result of removing palloc/pfree's. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Specifying attribute slot for storing/reading statistics
> > So these are 4 different data types (or classes of data types) that you > introduce in your extension? Or is that just a conceptual view and it's > stored in some other way (e.g. normalized in some way)? > At the SQL level these 4 durations are not distinguishable. For example for a tfloat (temporal float) we can have select tfloat '1@2000-01-01' -- Instant duration select tfloat '{1@2000-01-01 , 2@2000-01-02 , 1@2000-01-03}' -- Instant set duration select tfloat '[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03)' -- Sequence duration, left-inclusive and right-exclusive bound, select tfloat {'[1@2000-01-01, 2@2000-01-02 , 1@2000-01-03], '[1@2000-01-04, 1@2000-01-05]} ' -- Sequence set duration Nevertheless it is possible to restrict a column to a specific duration with a typymod specifier as in create table test ( ..., measure tfloat(Instant) -- only Instant durations accepted, ...) At the C level these 4 durations are distinguished and implement in something equivalent to a template abstract class Temporal with four subclasses TemporalInst, TemporalI, TemporalSeq, and TemporalS. Indeed the algorithms for manipulating these 4 durations are completely different. They are called template classes since they keep the Oid of the base type (float for tfloat or geometry for tgeompoint) in the same way array or ranges do. For more information please refer to the manual at github https://github.com/ULB-CoDE-WIT/MobilityDB/ > I don't think we're strongly against changing the code to allow this, as > long as it does not break existing extensions/code (unnecessarily). > > >If you want I can prepare a PR in order to understand the implications of > >these changes. Please let me know. > > > > I think having an actual patch to look at would be helpful. > I am preparing a first patch for the files selfuncs.h and selfunc.c and thus for instant duration selectivity. It basically 1) Moves some prototypes of the static functions from the .c to the .h file so that the functions are exported. 2) Passes the operator from the top level functions to the inner functions such as mcv_selectivity or ineq_histogram_selectivity. This allows me to call the functions twice, once for the value component and another for the time component, e.g. as follows. else if (cachedOp == CONTAINED_OP || cachedOp == OVERLAPS_OP) { /* Enable the addition of the selectivity of the value and time * dimensions since either may be missing */ int selec_value = 1.0, selec_time = 1.0; /* Selectivity for the value dimension */ if (MOBDB_FLAGS_GET_X(box->flags)) { operator = oper_oid(LT_OP, valuetypid, valuetypid); selec_value = scalarineqsel(root, operator, false, false , vardata, Float8GetDatum(box->xmin), valuetypid); operator = oper_oid(GT_OP, valuetypid, valuetypid); selec_value += scalarineqsel(root, operator, true, false , vardata, Float8GetDatum(box->xmax), valuetypid); selec_value = 1 - selec_value; } /* Selectivity for the time dimension */ if (MOBDB_FLAGS_GET_T(box->flags)) { operator = oper_oid(LT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ); selec_time = scalarineqsel(root, operator, false, false , vardata, TimestampTzGetDatum(box->tmin), TIMESTAMPTZOID); operator = oper_oid(GT_OP, T_TIMESTAMPTZ, T_TIMESTAMPTZ); selec_time += scalarineqsel(root, operator, true, false , vardata, TimestampTzGetDatum(box->tmax), TIMESTAMPTZOID); selec_time = 1 - selec_time; } selec = selec_value * selec_time; } Regards Esteban
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Thu, Sep 12, 2019 at 1:56 PM Amit Kapila wrote: > > On Wed, Sep 11, 2019 at 8:53 PM Masahiko Sawada wrote: > > > > I've attached the updated patch that incorporated all comments. I kept > > the function as superuser-restricted. > > > > Thanks for the updated patch. > > Few more comments: Thank you for your comments. > > * > + if (!superuser()) > + ereport(ERROR, > + (errcode > (ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be superuser to use raw > page functions"))); > > I think it is better to use a message like "must be superuser to use > pageinspect functions" as this function doesn't take raw page as > input. If you see other functions like bt_page_items which doesn't > take raw page as input has the message which I am suggesting. I can > see that tuple_data_split also has a similar message as you are > proposing, but I think that is also not very appropriate. Agreed. Will fix. > > * > else > + { > + if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) > + d[cnt++] = CStringGetTextDatum > ("HEAP_XMAX_LOCK_ONLY"); > > If the idea is that whenever decode_combined flag is false, we will > display the raw flags set on the tuple, then why to try to interpret > flags on a tuple in the above case. Hmm my understanding of 'decode_combined' is to decode the flags that we represent by using multiple flags. HEAP_XMAX_IS_LOCKED_ONLY is true either if HEAP_XMAX_LOCK_ONLY is set or not a multi and the EXCL_LOCK bit is set. That is it requires only one flag. So I thought that it's not a combined flag. > > * > + if (decode_combined && > + HEAP_LOCKED_UPGRADED(t_infomask)) > + d[cnt++] = CStringGetTextDatum("HEAP_LOCKED_UPGRADED"); > + else > + { > + if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) > + d[cnt++] = CStringGetTextDatum > ("HEAP_XMAX_LOCK_ONLY"); > + if ((t_infomask & HEAP_XMAX_IS_MULTI) != 0) > + d[cnt++] = CStringGetTextDatum > ("HEAP_XMAX_IS_MULTI"); > + } > > I am not completely sure whether we want to cover HEAP_LOCKED_UPGRADED > case even when decode_combined flag is set. It seems this is a bit > more interpretation of flags than we are doing in other cases. For > example, other cases like HEAP_XMAX_SHR_LOCK or HEAP_XMIN_FROZEN are > the flags that are explicitly set on the tuple so displaying them > makes sense, but the same is not true for HEAP_LOCKED_UPGRADED. > I thought it would be better to interpret it as much as possible, especially for diagnostic use cases. I'm concerned that user might not be able to get enough information for investigation if we intentionally filtered particular flags. > * > +CREATE FUNCTION heap_tuple_infomask_flags( > + t_infomask integer, > + t_infomask2 integer, > + decode_combined boolean DEFAULT false) > > I am not very happy with the parameter name 'decode_combined'. It is > not clear from the name what it means and I think it doesn't even > match with what we are actually doing here. How about raw_flags, > raw_tuple_flags or something like that? > raw_flags might be more straightforward. Or perhaps the third idea could be show_raw_flags? If other hackers agree to change the flag name I'll fix it. I'll submit the patch to fix the commit after we got a consensus on the above changes. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Hello. At Wed, 11 Sep 2019 17:26:44 +0300, Anastasia Lubennikova wrote in > 10.09.2019 14:42, Asim R P wrote: > > Hi Anastasia > > > > On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova > > mailto:a.lubennik...@postgrespro.ru>> > > wrote: > > > > > > But during the review, I found a bug in the current implementation. > > > New behavior must apply to crash-recovery only, now it applies to > > > archiveRecovery too. > > > That can cause a silent loss of a tablespace during regular standby > > > operation > > > since it never calls CheckRecoveryConsistency(). Yeah. We should take the same steps with redo operations on missing pages. Just record failure during inconsistent state then forget it if underlying tablespace is gone. If we had a record when we reached concsistency, we're in a serious situation and should stop recovery. log_invalid_page forget_invalid_pages and CheckRecoveryConsistency are the entry points of the feature to understand. > > > Steps to reproduce: > > > 1) run master and replica > > > 2) create dir for tablespace: > > > mkdir /tmp/tblspc1 > > > > > > 3) create tablespace and database on the master: > > > create tablespace tblspc1 location '/tmp/tblspc1'; > > > create database db1 tablespace tblspc1 ; > > > > > > 4) wait for replica to receive this changes and pause replication: > > > select pg_wal_replay_pause(); > > > > > > 5) move replica's tablespace symlink to some empty directory, > > > i.e. /tmp/tblspc2 > > > mkdir /tmp/tblspc2 > > > ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384 > > > > > > > By changing the tablespace symlink target, we are silently nullifying > > effects of a committed transaction from the standby data directory - > > the directory structure created by the standby for create tablespace > > transaction. This step, therefore, does not look like a valid test > > case to me. Can you share a sequence of steps that does not involve > > changing data directory manually? I see it as the same. WAL is inconsistent with what happend on storage with the steps. Database is just broken. > Hi, the whole idea of the test is to reproduce a data loss. For > example, if the disk containing this tablespace failed. So, apparently we must start recovery from a backup before that failure happened in that case, and that should ends in success. # I remember that the start point of this patch is a crash after # table space drop subsequent to several operations within the # table space. Then, crash recovery fails at an operation in the # finally-removed tablespace. Is it right? > Probably, simply deleting the directory > 'postgresql_data_replica/pg_tblspc/16384' > would work as well, though I was afraid that it can be caught by some > earlier checks and my example won't be so illustrative. > > > > Thank you for the review feedback. I agree with all the points. Let > > me incorporate them (I plan to pick this work up and drive it to > > completion as Paul got busy with other things). > > > > But before that I'm revisiting another solution upthread, that of > > creating restart points when replaying create/drop database commands > > before making filesystem changes such as removing a directory. The > > restart points should align with checkpoints on master. The concern > > against this solution was creation of restart points will slow down > > recovery. I don't think crash recovery is affected by this solution > > because of the already existing enforcement of checkpoints. WAL > > records prior to a create/drop database will not be seen by crash > > recovery due to the checkpoint enforced during the command's normal > > execution. > > > > I haven't measured the impact of generating extra restart points in > previous solution, so I cannot tell whether concerns upthread are > justified. Still, I enjoy latest design more, since it is clear and > similar with the code of checking unexpected uninitialized pages. In > principle it works. And the issue, I described in previous review can > be easily fixed by several additional checks of InHotStandby macro. Generally we shouldn't trigger useless restart point for uncertain reasons. If standby crashes, it starts the next recovery from the latest *restart point*. Even in that case what we should do is the same. Of course, for testing, we *should* establish a restartpoint manually in order to establish the prerequisite state. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant wrote: On 2019-Sep-11, Amit Khandekar wrote: I reproduced the error "exceeded maxAllocatedDescs (492) while trying to open file ...", which was also discussed about in the thread [1]. This issue is similar but not exactly the same as [1]. In [1], the file for which this error used to show up was "pg_logical/mappings/map" , while here it's the .spill file. And here the issue , in short, seems to be : The .spill file does not get closed there and then, unlike in [1] where there was a file descriptor leak. Uh-oh :-( Thanks for the reproducer -- I confirm it breaks things. Looking at the code, what might be happening is, ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the files, but leaves them open if end of file is not reached. Eventually if end of file is reached, it gets closed. The function returns back without closing the file descriptor if reorder buffer changes being restored are more than max_changes_in_memory. Probably later on, the rest of the changes get restored in another ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot of such files opened, we can run out of the max files that PG decides to keep open (it has some logic that takes into account system files allowed to be open, and already opened). Makes sense. Offhand, what I am thinking is, we need to close the file descriptor before returning from ReorderBufferRestoreChanges(), and keep track of the file offset and file path, so that next time we can resume reading from there. I think doing this all the time would make restore very slow -- there's a reason we keep the files open, after all. How much slower? It certainly will have a hit, but maybe it's negligible compared to all the other stuff happening in this code? It would be better if we can keep the descriptors open as much as possible, and only close them if there's trouble. I was under the impression that using OpenTransientFile was already taking care of that, but that's evidently not the case. I don't see how the current API could do that transparently - it does track the files, but the user only gets a file descriptor. With just a file descriptor, how could the code know to do reopen/seek when it's going just through the regular fopen/fclose? Anyway, I agree we need to do something, to fix this corner case (many serialized in-progress transactions). ISTM we have two options - either do something in the context of reorderbuffer.c, or extend the transient file API somehow. I'd say the second option is the right thing going forward, because it does allow doing it transparently and without leaking details about maxAllocatedDescs etc. There are two issues, though - it does require changes / extensions to the API, and it's not backpatchable. So maybe we should start with the localized fix in reorderbuffer, and I agree tracking offset seems reasonable. As a sidenote - in the other thread about streaming, one of the patches does change how we log subxact assignments. In the end, this allows using just a single file for the top-level transaction, instead of having one file per subxact. That would also solve this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Thu, Sep 12, 2019 at 11:43 AM Michael Paquier wrote: > > On Wed, Sep 11, 2019 at 11:22:45PM +0800, Masahiko Sawada wrote: > > Hmm it will be more consistent with other functions but I think we > > would need to increase the pageinspect version to 1.8 and need the new > > sql file to rename the function name. And it will be for PG12, not > > PG13. If we have to do it someday I think it's better to do it in PG12 > > that the table AM has been introduced to. Anyway I've attached > > separate patch for it. > > Like Alvaro, I would discard this one for now. > > > I've attached the updated patch that incorporated all comments. I kept > > the function as superuser-restricted. > > But not this one. So committed. > I had a few comments as posted in the previous email which I think we can address incrementally as the patch for those is produced. However, one point which I am slightly worried is the last one in my email. Are we happy with the name of the new parameter in the API decode_combined? Because if we decide to change that then we need to change the exposed API and I think in the ideal case we need to change the version as well, but I might be wrong and maybe the parameter name as committed is good enough in which case we should be good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: psql - improve test coverage from 41% to 88%
Ok. Rebased version added, with some minor changes to improve readability (comments, variables). Few comments: [...] Commented line can be removed Commented lines can be removed ??? can be changed to some suitable heading tab-complation to be changed to tab-completion Commented lines can be removed, some more are present below these lines also. Thanks for this review. The lines were really tests I did that had some issues because of the way the Expect module works, and are not useful for inclusion in the code base. Here is a v5. -- Fabien Coelho - CRI, MINES ParisTechdiff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index b7d36b65dd..aabc27ab6f 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -506,6 +506,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ], 1, + '', [qr{^$}], [qr/^WARNING.*checksum verification failed/s], 'pg_basebackup reports checksum mismatch'); @@ -526,6 +527,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2" ], 1, + '', [qr{^$}], [qr/^WARNING.*further.*failures.*will.not.be.reported/s], 'pg_basebackup does not report more than 5 checksum mismatches'); @@ -542,6 +544,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3" ], 1, + '', [qr{^$}], [qr/^WARNING.*7 total checksum verification failures/s], 'pg_basebackup correctly report the total number of checksum mismatches'); diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl index 59228b916c..cf4811d382 100644 --- a/src/bin/pg_checksums/t/002_actions.pl +++ b/src/bin/pg_checksums/t/002_actions.pl @@ -62,6 +62,7 @@ sub check_relation_corruption '--filenode', $relfilenode_corrupted ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data for single relfilenode on tablespace $tablespace" @@ -71,6 +72,7 @@ sub check_relation_corruption $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data on tablespace $tablespace"); @@ -203,6 +205,7 @@ sub fail_corrupt $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/^$/], [qr/could not read block 0 in file.*$file\":/], "fails for corrupted data in $file"); diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl index 3b63ad230f..ebe9b80a52 100644 --- a/src/bin/pg_controldata/t/001_pg_controldata.pl +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl @@ -33,6 +33,7 @@ close $fh; command_checks_all( [ 'pg_controldata', $node->data_dir ], 0, + '', [ qr/WARNING: Calculated CRC checksum does not match value stored in file/, qr/WARNING: invalid WAL segment size/ diff --git a/src/bin/pg_resetwal/t/002_corrupted.pl b/src/bin/pg_resetwal/t/002_corrupted.pl index f9940d7fc5..1990669d26 100644 --- a/src/bin/pg_resetwal/t/002_corrupted.pl +++ b/src/bin/pg_resetwal/t/002_corrupted.pl @@ -30,6 +30,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it/ @@ -46,6 +47,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/\Qpg_resetwal: warning: pg_control specifies invalid WAL segment size (0 bytes); proceed with caution\E/ diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index b82d3f65c4..01010914fe 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -50,7 +50,7 @@ sub pgbench push @cmd, @args; - $node->command_checks_all(\@cmd, $stat, $out, $err, $name); + $node->command_checks_all(\@cmd, $stat, '', $out, $err, $name); # cleanup? #unlink @filenames or die "cannot unlink files (@filenames): $!"; diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl index f7fa18418b..b58f3548c3 100644 --- a/src/bin/pgbench/t/002_pgbench_no_server.pl +++ b/src/bin/pgbench/t/002_pgbench_no_server.pl @@ -25,7 +25,7 @@ sub pgbench my ($opts, $stat, $out, $err, $name) = @_; print STDERR "opts=$opts, stat=$stat, out=$out, err=$err, name=$name"; command_checks_all([ 'pgbench', split(/\s+/, $opts) ], - $stat, $out, $err, $name); + $stat, '', $out, $err, $name); return; } @@ -52,7 +52,7 @@ sub pgbench_scripts push @cmd, '-f', $filename; } } -
Re: psql - improve test coverage from 41% to 88%
On Thu, Sep 12, 2019 at 11:56 AM Fabien COELHO wrote: > > > On Wed, Sep 11, 2019 at 10:52:01PM +0200, Fabien COELHO wrote: > >> AFAICR this is because the coverage was not the same:-) Some backslash > >> commands just skip silently to the end of the line, so that intermediate > >> \commands on the same line are not recognized/processed the same, so I > >> moved > >> everything on one line to avoid this. > > > > I see. So basically this tests for more code paths to ignore > > backslash commands and it improves the coverage of \elif. Applied > > after fixing two nits: > > - Indentation was incorrect. > > - Moved the \elif test closer to the existing one for the false > > branch (you can grep #2 to find it). > > Ok. Rebased version added, with some minor changes to improve readability > (comments, variables). > > Few comments: +sub create_test_file +{ + my ($fname, $contents) = @_; + my $fn = $node->basedir . '/' . $fname; + #ok(not -e $fn, "$fn must not already exists"); + append_to_file($fn, $contents); + return $fn; +} Commented line can be removed +# nope, interacts on tty +#psql('-W', 0, "foo\n", [ qr{^$} ], [ qr{^$} ], 'psql -W'); +psql('-x', 0, "SELECT 1 AS one, 2 AS two;\n", [ qr{one \| 1.*two \| 2}s ], $EMPTY, 'psql -x'); +# some issue, \0 is not welcome somewhere +#psql('-A -z', "SELECT 1 AS one, 2 AS two;\n", [ qr{one.two}s, qr{1.2}s ], $EMPTY, 'psql -z'); +#psql('-A -0', "SELECT 1 AS one, 2 AS two;\n", [ qr{two.1}s ], $EMPTY, 'psql -0'); +psql('-1', 0, "SELECT 54;\nSELECT 32;\n", [ qr{54}, qr{32} ], $EMPTY, 'psql -1'); Commented lines can be removed + [ "\\lo_list\n", [ qr{Large objects} ] ], + [ "\\if true\\q\\endif\n", $EMPTY ], + # ??? + #[ "SELECT md5('hello world');\n\\s\n", [ qr{5eb63bbbe0}, qr{SELECT md5} ] ], + [ "\\set\n", [ qr{ENCODING = }, qr{VERSION_NUM = } ] ], + [ "\\set COMP_KEYWORD_CASE preserve-lower\n\\set COMP_KEYWORD_CASE lower\n" . #[ "Select"] commented line can be removed ??? can be changed to some suitable heading +psql('', 0, "\\s /dev/null\n", $EMPTY, $EMPTY, 'psql \s null'); + +# tab-complation +ipsql('-P pager', 0, 5, + [ # commands tab-complation to be changed to tab-completion + # but the coverage works as expected. + #[ "CREATE \t", qr/i(MATERIALIZED VIEW.*postgres=\# )?/s ], + #[ "\\r\n", qr/Query buffer reset.*postgres=\# /s ], + [ "CREATE \t\\r\n", qr/Query buffer reset.*postgres=\# /s ], + #[ "DROP \t", qr/(UNLOGGED.*postgres=\# )?/s ], + #[ "\\r\n", qr/Query buffer reset.*postgres=\# /s ], + [ "DROP \t\\r\n", qr/Query buffer reset.*postgres=\# /s ], + #[ "ALTER \t", qr/(TABLESPACE.*postgres=\# )/s ], + #[ "\\r\n", qr/Query buffer reset.*postgres=\# /s ], Commented lines can be removed, some more are present below these lines also. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: psql - improve test coverage from 41% to 88%
On Wed, Sep 11, 2019 at 10:52:01PM +0200, Fabien COELHO wrote: AFAICR this is because the coverage was not the same:-) Some backslash commands just skip silently to the end of the line, so that intermediate \commands on the same line are not recognized/processed the same, so I moved everything on one line to avoid this. I see. So basically this tests for more code paths to ignore backslash commands and it improves the coverage of \elif. Applied after fixing two nits: - Indentation was incorrect. - Moved the \elif test closer to the existing one for the false branch (you can grep #2 to find it). Ok. Rebased version added, with some minor changes to improve readability (comments, variables). -- Fabien.diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index b7d36b65dd..aabc27ab6f 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -506,6 +506,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt" ], 1, + '', [qr{^$}], [qr/^WARNING.*checksum verification failed/s], 'pg_basebackup reports checksum mismatch'); @@ -526,6 +527,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt2" ], 1, + '', [qr{^$}], [qr/^WARNING.*further.*failures.*will.not.be.reported/s], 'pg_basebackup does not report more than 5 checksum mismatches'); @@ -542,6 +544,7 @@ system_or_bail 'pg_ctl', '-D', $pgdata, 'start'; $node->command_checks_all( [ 'pg_basebackup', '-D', "$tempdir/backup_corrupt3" ], 1, + '', [qr{^$}], [qr/^WARNING.*7 total checksum verification failures/s], 'pg_basebackup correctly report the total number of checksum mismatches'); diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl index 59228b916c..cf4811d382 100644 --- a/src/bin/pg_checksums/t/002_actions.pl +++ b/src/bin/pg_checksums/t/002_actions.pl @@ -62,6 +62,7 @@ sub check_relation_corruption '--filenode', $relfilenode_corrupted ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data for single relfilenode on tablespace $tablespace" @@ -71,6 +72,7 @@ sub check_relation_corruption $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/Bad checksums:.*1/], [qr/checksum verification failed/], "fails with corrupted data on tablespace $tablespace"); @@ -203,6 +205,7 @@ sub fail_corrupt $node->command_checks_all( [ 'pg_checksums', '--check', '-D', $pgdata ], 1, + '', [qr/^$/], [qr/could not read block 0 in file.*$file\":/], "fails for corrupted data in $file"); diff --git a/src/bin/pg_controldata/t/001_pg_controldata.pl b/src/bin/pg_controldata/t/001_pg_controldata.pl index 3b63ad230f..ebe9b80a52 100644 --- a/src/bin/pg_controldata/t/001_pg_controldata.pl +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl @@ -33,6 +33,7 @@ close $fh; command_checks_all( [ 'pg_controldata', $node->data_dir ], 0, + '', [ qr/WARNING: Calculated CRC checksum does not match value stored in file/, qr/WARNING: invalid WAL segment size/ diff --git a/src/bin/pg_resetwal/t/002_corrupted.pl b/src/bin/pg_resetwal/t/002_corrupted.pl index f9940d7fc5..1990669d26 100644 --- a/src/bin/pg_resetwal/t/002_corrupted.pl +++ b/src/bin/pg_resetwal/t/002_corrupted.pl @@ -30,6 +30,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it/ @@ -46,6 +47,7 @@ close $fh; command_checks_all( [ 'pg_resetwal', '-n', $node->data_dir ], 0, + '', [qr/pg_control version number/], [ qr/\Qpg_resetwal: warning: pg_control specifies invalid WAL segment size (0 bytes); proceed with caution\E/ diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index b82d3f65c4..01010914fe 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -50,7 +50,7 @@ sub pgbench push @cmd, @args; - $node->command_checks_all(\@cmd, $stat, $out, $err, $name); + $node->command_checks_all(\@cmd, $stat, '', $out, $err, $name); # cleanup? #unlink @filenames or die "cannot unlink files (@filenames): $!"; diff --git a/src/bin/pgbench/t/002_pgbench_no_server.pl b/src/bin/pgbench/t/002_pgbench_no_server.pl index f7fa18418b..b58f3548c3 100644 --- a/src/bin/pgbench/t/002_pgbench_no_server.pl +++ b/src/bin/pgbench/t/002_pgbench_no_server.pl @@ -25,7 +25,7 @@ sub pgbench my ($opts, $stat, $out, $err, $name) = @_; print STDERR "opts=$opts, stat=$stat, out=$out, err=$err, name=$name"; command_checks_all([ 'pgbench', split(/\s+/, $opts) ], - $stat, $out, $err, $name); + $stat, '',
Parallel Full Hash Join
Hello, While thinking about looping hash joins (an alternative strategy for limiting hash join memory usage currently being investigated by Melanie Plageman in a nearby thread[1]), the topic of parallel query deadlock hazards came back to haunt me. I wanted to illustrate the problems I'm aware of with the concrete code where I ran into this stuff, so here is a new-but-still-broken implementation of $SUBJECT. This was removed from the original PHJ submission when I got stuck and ran out of time in the release cycle for 11. Since the original discussion is buried in long threads and some of it was also a bit confused, here's a fresh description of the problems as I see them. Hopefully these thoughts might help Melanie's project move forward, because it's closely related, but I didn't want to dump another patch into that other thread. Hence this new thread. I haven't succeeded in actually observing a deadlock with the attached patch (though I did last year, very rarely), but I also haven't tried very hard. The patch seems to produce the right answers and is pretty scalable, so it's really frustrating not to be able to get it over the line. Tuple queue deadlock hazard: If the leader process is executing the subplan itself and waiting for all processes to arrive in ExecParallelHashEndProbe() (in this patch) while another process has filled up its tuple queue and is waiting for the leader to read some tuples an unblock it, they will deadlock forever. That can't happen in the the committed version of PHJ, because it never waits for barriers after it has begun emitting tuples. Some possible ways to fix this: 1. You could probably make it so that the PHJ_BATCH_SCAN_INNER phase in this patch (the scan for unmatched tuples) is executed by only one process, using the "detach-and-see-if-you-were-last" trick. Melanie proposed that for an equivalent problem in the looping hash join. I think it probably works, but it gives up a lot of parallelism and thus won't scale as nicely as the attached patch. 2. You could probably make it so that only the leader process drops out of executing the inner unmatched scan, and then I think you wouldn't have this very specific problem at the cost of losing some (but not all) parallelism (ie the leader), but there might be other variants of the problem. For example, a GatherMerge leader process might be blocked waiting for the next tuple for a tuple from P1, while P2 is try to write to a full queue, and P1 waits for P2. 3. You could introduce some kind of overflow for tuple queues, so that tuple queues can never block because they're full (until you run out of extra memory buffers or disk and error out). I haven't seriously looked into this but I'm starting to suspect it's the industrial strength general solution to the problem and variants of it that show up in other parallelism projects (Parallel Repartition). As Robert mentioned last time I talked about this[2], you'd probably only want to allow spooling (rather than waiting) when the leader is actually waiting for other processes; I'm not sure how exactly to control that. 4. Goetz Graefe's writing about parallel sorting comes close to this topic, which he calls flow control deadlocks. He mentions the possibility of infinite spooling like (3) as a solution. He's describing a world where producers and consumers are running concurrently, and the consumer doesn't just decide to start running the subplan (what we call "leader participation"), so he doesn't actually have a problem like Gather deadlock. He describes planner-enforced rules that allow deadlock free execution even with fixed-size tuple queue flow control by careful controlling where order-forcing operators are allowed to appear, so he doesn't have a problem like Gather Merge deadlock. I'm not proposing we should create a whole bunch of producer and consumer processes to run different plan fragments, but I think you can virtualise the general idea in an async executor with "streams", and that also solves other problems when you start working with partitions in a world where it's not even sure how many workers will show up. I see this as a long term architectural goal requiring vast amounts of energy to achieve, hence my new interest in (3) for now. Hypothetical inter-node deadlock hazard: Right now I think it is the case the whenever any node begins pulling tuples from a subplan, it continues to do so until either the query ends early or the subplan runs out of tuples. For example, Append processes its subplans one at a time until they're done -- it doesn't jump back and forth. Parallel Append doesn't necessarily run them in the order that they appear in the plan, but it still runs each one to completion before picking another one. If we ever had a node that didn't adhere to that rule, then two Parallel Full Hash Join nodes could dead lock, if some of the workers were stuck waiting in one while some were stuck waiting in the other. If we were
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Wed, Sep 11, 2019 at 11:22:45PM +0800, Masahiko Sawada wrote: > Hmm it will be more consistent with other functions but I think we > would need to increase the pageinspect version to 1.8 and need the new > sql file to rename the function name. And it will be for PG12, not > PG13. If we have to do it someday I think it's better to do it in PG12 > that the table AM has been introduced to. Anyway I've attached > separate patch for it. Like Alvaro, I would discard this one for now. > I've attached the updated patch that incorporated all comments. I kept > the function as superuser-restricted. But not this one. So committed. I have gone through the patch and adjusted a couple of things in the tests, the docs with weird formulations and an example leading mainly to NULLs returned when scanning the first page of pg_class. The tests needed some improvements to gain in clarity (no need for unnest with 2 elements, added tests for all the combined flags, etc.). The patch was not indented either but this is no big deal. I hope I forgot to credit nobody in the commit message. If that's the case, you are the winner of a drink of your choice the next time we meet. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On Wed, Sep 11, 2019 at 8:53 PM Masahiko Sawada wrote: > > I've attached the updated patch that incorporated all comments. I kept > the function as superuser-restricted. > Thanks for the updated patch. Few more comments: * + if (!superuser()) + ereport(ERROR, + (errcode (ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be superuser to use raw page functions"))); I think it is better to use a message like "must be superuser to use pageinspect functions" as this function doesn't take raw page as input. If you see other functions like bt_page_items which doesn't take raw page as input has the message which I am suggesting. I can see that tuple_data_split also has a similar message as you are proposing, but I think that is also not very appropriate. * else + { + if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) + d[cnt++] = CStringGetTextDatum ("HEAP_XMAX_LOCK_ONLY"); If the idea is that whenever decode_combined flag is false, we will display the raw flags set on the tuple, then why to try to interpret flags on a tuple in the above case. * + if (decode_combined && + HEAP_LOCKED_UPGRADED(t_infomask)) + d[cnt++] = CStringGetTextDatum("HEAP_LOCKED_UPGRADED"); + else + { + if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) + d[cnt++] = CStringGetTextDatum ("HEAP_XMAX_LOCK_ONLY"); + if ((t_infomask & HEAP_XMAX_IS_MULTI) != 0) + d[cnt++] = CStringGetTextDatum ("HEAP_XMAX_IS_MULTI"); + } I am not completely sure whether we want to cover HEAP_LOCKED_UPGRADED case even when decode_combined flag is set. It seems this is a bit more interpretation of flags than we are doing in other cases. For example, other cases like HEAP_XMAX_SHR_LOCK or HEAP_XMIN_FROZEN are the flags that are explicitly set on the tuple so displaying them makes sense, but the same is not true for HEAP_LOCKED_UPGRADED. * +CREATE FUNCTION heap_tuple_infomask_flags( + t_infomask integer, + t_infomask2 integer, + decode_combined boolean DEFAULT false) I am not very happy with the parameter name 'decode_combined'. It is not clear from the name what it means and I think it doesn't even match with what we are actually doing here. How about raw_flags, raw_tuple_flags or something like that? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com