Re: [HACKERS] amcheck (B-Tree integrity checking tool)
On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan wrote: > On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan wrote: >>> which isn't an issue here, but reinforces my point about the (badly >>> documented) assumption that we don't release locks on user relations >>> early. >> I think even if we don't need to retain locks till transaction end in the present case, but Andres's point has a merit that it is good to follow an undocumented rule of releasing locks at transaction end for user relations unless there is some benefit in releasing the locks early in some situation. I think that can avoid unnecessary worry that whether it safe to do in some situation. >> You are right about the substantive issue, I assume, but I have a hard >> time agreeing with the idea that it's even badly documented when there >> are at least 10 counter-examples of blithely doing this. In any case, >> I find it very confusing when you talk about things as established >> practice/coding standards when they're very clearly not consistently >> adhered to at all. You should at least say "let's not make a bad >> situation any worse", or something, so that I don't need to spend 10 >> minutes pulling my hair out. > > BTW, aren't there cases where a PARALLEL SAFE function that needs to > acquire locks on some arbitrary relation not referenced in the query > can get locks on its own, which may only last as long as the parallel > worker's lifetime? This could happen *despite* the fact that the > author of the function may have imagined that callers could not > release relation level locks early (i.e., by not releasing them > directly, and so correctly following this "badly documented > assumption"). > > It seems like the existence of PARALLEL RESTRICTED is primarily > motivated by making stuff that definitely needs the locks to last > until xact end being sure that that happens -- the docs say so. > I don't think so. I think one of the primary reason of releasing locks in workers is that we don't have any way to transfer them to the leader which can release them at transaction end. > This > seems wholly inconsistent with the idea that you're not supposed to > let that happen under any circumstances. I find all this highly > confusing. Have I missed some further subtlety that applies with > PARALLEL RESTRICTED? > The basic idea of parallel restricted is that any expression will be considered parallel restricted if that can't be executed in the worker, for example, a reference to initplans in a query. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Index Scans
On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas wrote: > On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila wrote: >>> The hunk in indexam.c looks like a leftover that can be omitted. >> >> It is not a leftover hunk. Earlier, the patch has the same check >> btparallelrescan, but based on your comment up thread [1] (Why does >> btparallelrescan cater to the case where scan->parallel_scan== NULL?), >> this has been moved to indexam.c. > > That's not my point. The point is, if it's not a parallel scan, > index_parallelrescan() should never get called in the first place. So > therefore it shouldn't need to worry about scan->parallel_scan being > NULL. It seems possibly reasonable to put something like > Assert(scan->parallel_scan != NULL) in there, but arranging to return > without doing anything in that case seems like it just masks possible > bugs in the calling code. And really I'm not sure we even need the > Assert. > I have removed the check from index_parallelrescan() and ensured in the caller that it must be called only when parallel_scan descriptor is present. Comments from another e-mail: > This design presupposes that every AM that might ever want to do > parallel scans is happy with genericcostestimate()'s method of > computing the number of index pages that will be fetched. However, > that might not be true for every possible AM. In fact, it's already > not true for BRIN, which always reads the whole index. Now, since > we're only concerned with btree at the moment, nothing would be > immediately broken by this approach but it seems we're just setting > ourselves up for future pain. > I think to make it future-proof, we could add a generic page estimation function. However, I have tried to implement based on your below suggestion, see if that looks better to you, otherwise we can introduce a generic page estimation API. > I have what I think might be a better idea: let's make > amcostestimate() responsible for returning a suggested parallel degree > for the path via an additional out parameter. cost_index() can then > reduce that value if it seems like not enough heap pages will be > fetched to justify the return value, or it can override it completely > if parallel_degree is set for the relation. > I see the value of your idea, but I think it might be better to return the number of IndexPages required to be scanned from amcostestimate() and then use the already computed value of heap_pages in cost_index to identify the number of workers. This will make the calculation simple and avoid overriding the return value. Now, the returned value of index pages will only be used for cases when partial path is being constructed, but I think that is okay, because we are not doing any extra calculation to compute the number of index pages fetched. Another argument could be that the number of index pages to be used for parallelism can be different from the number of pages to be scanned or what ever is used in cost computation, but I think that is also easy to change later when we create partial paths for indexes other than btree. I have implemented the above idea in the attached patch (parallel_index_opt_exec_support_v9.patch) > Then we don't need to run > this logic twice to compute the same value, and we don't violate the > AM abstraction layer. > We can avoid computing the same value twice, however, with your suggested approach, we have to do all the additional work for the cases where employing parallel workers is not beneficial, so not sure if there is a net gain. > BTW, the changes to add_partial_path() aren't needed, because an > IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on > top of it, and that won't be the case with this or any other pending > patch. If we get the ability to have a Parallel Bitmap Heap Scan that > takes a parallel index scan rather than a standard index scan as > input, then we'll need something like this. But for now it's probably > best to just update the comments and remove the Assert(). > Right, changed as per suggestion. > I think you can also leave out the documentation changes from these > patches. I'll do some general rewriting of the parallel query section > once we know exactly what capabilities we'll have in this release; I > think that will work better than trying to update them a bit at a time > for each patch. > Okay, removed the documentation part. Patches to be used: guc_parallel_index_scan_v1.patch [1], parallel_index_scan_v8.patch, parallel_index_opt_exec_support_v9.patch [1] - https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com parallel_index_scan_v8.patch Description: Binary data parallel_index_opt_exec_support_v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpr
Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)
Jim Nasby writes: > On 2/8/17 5:54 PM, Tom Lane wrote: >> Maybe it'd be better to imagine this as something closer to planagg.c, >> that is it knows how to apply a specific high-level optimization that >> might or might not be a win, so it builds a path describing that and sees >> if it looks cheaper than a path done the normal way. The fact that >> we could even build a path describing a union is something that wasn't >> there before 9.6, but maybe there's enough infrastructure for that now. > It's encouraging that there's at least a template to follow there... If > there is missing infrastructure, is it likely to be major? My uninformed > guess would be that the pathification was the major hurdle. I wrote a POC patch for this on a long airplane ride. It's not complete, and I'm sure there are bugs as well, but it makes your example case better. What I did about the de-duplication issue is to de-dup using the CTIDs of all baserels in the query. This means the optimization is only applicable to cases where all the rels have CTIDs ... but other methods such as inspecting unique keys ain't gonna work for non-table rels either, so I think this is about the best we can hope for. However, I did not understand your point about: > BTW, there's an important caveat here: users generally do NOT want > duplicate rows from the fact table if the dimension table results aren't > unique. so maybe we should be thinking about some other way entirely? Anyway, what's attached below produces this on your example query: Aggregate (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 loops=1) -> Result (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 rows=3 loops=1) -> Unique (cost=38.03..38.07 rows=4 width=18) (actual time=0.150..0.154 rows=3 loops=1) -> Sort (cost=38.03..38.04 rows=4 width=18) (actual time=0.150..0.151 rows=4 loops=1) Sort Key: f.ctid, d1.ctid, d2.ctid Sort Method: quicksort Memory: 25kB -> Append (cost=4.85..37.99 rows=4 width=18) (actual time=0.070..0.106 rows=4 loops=1) -> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.069..0.075 rows=2 loops=1) -> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.035..0.038 rows=2 loops=1) -> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1) Index Cond: ('1'::text = t) -> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1) Recheck Cond: (f1 = d1.s) Heap Blocks: exact=2 -> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1) Index Cond: (f1 = d1.s) -> Index Scan using dim_pkey on dim d2 (cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2) Index Cond: (f.f2 = s) -> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.025..0.029 rows=2 loops=1) -> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.022..0.025 rows=2 loops=1) -> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1) Index Cond: ('1'::text = t) -> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1) Recheck Cond: (f2 = d2.s) Heap Blocks: exact=2 -> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1) Index Cond: (f2 = d2.s) -> Index Scan using dim_pkey on dim d1 (cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2) Index Cond: (f.f1 = s) The main remaining piece of work here is that, as you can see from the above, it fails to eliminate joins to tables that we don't actually need in a particular UNION arm. This is because the references to those tables' ctid columns prevent analyzejoins.c from removing the joins. I've thought about ways to deal with that but haven't come up with anything that wasn't pretty ugly and/or invasive.
Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
On 11 February 2017 at 23:45, Corey Huinker wrote: > So you'd just want to know nesting depth, with no indicator of true/false? Even that's more than bash does, for example: $ if true ; then > if false ; then > : > fi > fi I'm a bit confused how the true/false is actually valuable. It doesn't tell you how the expression actually evaluated, just where you are in the code you're typing in which you can tell equally well by looking at what code you're typing in. The reason nesting level is handy is just to remind you in case you forget. For debugging scripts it would be handy to have some way to tell whether the \if expression actually evaluated to true or false but that wouldn't be in the prompt I don't think. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
On Sat, Feb 11, 2017 at 5:57 PM, Greg Stark wrote: > On 10 February 2017 at 21:36, Fabien COELHO wrote: > >> command prompt is now > >> --- --- > >> \echo bob '' = initial state, no branch going on at all > >> \if yes 't' = inside a true branch > >> \if no'tf' = false inside a true > >> \endif't' = back to just the true branch > >> \if yes 'tt' > >> \if yes 'ttt' > >> \if yes '...ttt' = only show the last 3, but let it be known that > >> there's at least one more' > >> \else '...ttz' = past the point of a true bit of this branch > > > > > > I like the "tfz" idea. I'm not sure whether the up to 6 characters is a > > good, though. > > > I haven't been following this thread but just skimming through it for > the first time I thought this was more baroque than I was expecting. I > was expecting something like a { for each level of nested if you're in > so you can see how many deep you are. I didn't expect to see anything > more complex than that. > So you'd just want to know nesting depth, with no indicator of true/false?
Re: [HACKERS] Parallel Index Scans
On Sat, Feb 11, 2017 at 9:41 PM, Robert Haas wrote: > On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila wrote: >>> Why can't we rely on _bt_walk_left? >> >> The reason is mentioned in comments, but let me try to explain with >> some example. When you reach that point of code, it means that either >> the current page (assume page number is 10) doesn't contain any >> matching items or it is a half-dead page, both of which indicates that >> we have to move to the previous page. Now, before checking if the >> current page contains matching items, we signal parallel machinery >> (via _bt_parallel_release) to allow workers to read the previous page >> (assume previous page number is 9). So it is quite possible that >> after deciding that current page (page number 10) doesn't contain any >> matching tuples if we directly move to the previous page (in this case >> it will be 9) by using _bt_walk_left, some other worker would have >> read page 9. In short, if we directly use _bt_walk_left(), then we >> are prone to returning some of the values twice as multiple workers >> can read the same page. > > But ... the entire point of the seize-and-release stuff is to avoid > this problem. You're suppose to seize the scan, read the current > page, walk left, store the page you find in the scan, and then release > the scan. > Exactly and that is what is done in the patch. Basically, if we found that the current page is half-dead or it doesn't contain any matching items, then release the current buffer, seize the scan, read the current page, walk left and so on. I am slightly confused here because it seems both of us agree on what is the right thing to do and according to me that is how it is implemented. Are you just ensuring about whether I have implemented as discussed or do you see a problem with the way it is implemented? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
On 10 February 2017 at 21:36, Fabien COELHO wrote: >> command prompt is now >> --- --- >> \echo bob '' = initial state, no branch going on at all >> \if yes 't' = inside a true branch >> \if no'tf' = false inside a true >> \endif't' = back to just the true branch >> \if yes 'tt' >> \if yes 'ttt' >> \if yes '...ttt' = only show the last 3, but let it be known that >> there's at least one more' >> \else '...ttz' = past the point of a true bit of this branch > > > I like the "tfz" idea. I'm not sure whether the up to 6 characters is a > good, though. I haven't been following this thread but just skimming through it for the first time I thought this was more baroque than I was expecting. I was expecting something like a { for each level of nested if you're in so you can see how many deep you are. I didn't expect to see anything more complex than that. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)
On Sat, Feb 11, 2017 at 3:48 PM, Fabien COELHO wrote: > > Just realized that '?' means "unknown transactional status" in %x. That >> might cause confusion if a person had a prompt of %x%R. Is that enough >> reason to pick a different cue? >> > > Argh. > > "\?\.?[tfz]" seems distinctive enough. Note that %R uses "'=-*^!$( and %x > uses *!?, which means that they already share 2 characters, so adding ? > does not seem like a big issue if it was not one before. > > Otherwise, maybe "&" or "%", but it is less about a condition. Fair enough, it shouldn't be too confusing then. The get_prompt() function can see the global pset, obviously, but can't see the scan_state, where the if-stack currently resides. I could give up on the notion of a per-file if-stack and just have one in pset, but that might make life difficult for whomever is brave enough to tackle \while loops. Or I could give pset a pointer to the current if-stack inside the scan_state, or I could have pset hold a stack of stacks. Unsure which way would be best.
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On 02/02/2015 03:10 PM, Andres Freund wrote: I think if we should instead just use the new index, repoint the dependencies onto the new oid, and then afterwards, when dropping, rename the new index one onto the old one. That means the oid of the index will change and some less than pretty grovelling around dependencies, but it still seems preferrable to what we're discussing here otherwise. I think that sounds like a good plan. The oid change does not seem like a too big deal to me, especially since that is what users will get now too. Do you still think this is the right way to solve this? I have attached my work in progress patch which implements and is very heavily based on Michael's previous work. There are some things left to do but I think I should have a patch ready for the next commitfest if people still like this type of solution. I also changed index_set_state_flags() to be transactional since I wanted the old index to become invalid at exactly the same time as the new becomes valid. From reviewing the code that seems like a safe change. A couple of bike shedding questions: - Is the syntax "REINDEX CONCUURENTLY " ok? - What should we do with REINDEX DATABASE CONCURRENTLY and the system catalog? I so not think we can reindex the system catalog concurrently safely, so what should REINDEX DATABASE do with the catalog indexes? Skip them, reindex them while taking locks, or just error out? - What level of information should be output in VERBOSE mode? What remains to be implemented: - Support for exclusion constraints - Look more into how I handle constraints (currently the temporary index too seems to have the PRIMARY KEY flag) - Support for the VERBOSE flag - More testing to catch bugs Andreas diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index 306def4a15..ca1aeca65f 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -923,7 +923,8 @@ ERROR: could not serialize access due to read/write dependencies among transact Acquired by VACUUM (without FULL), - ANALYZE, CREATE INDEX CONCURRENTLY, and + ANALYZE, CREATE INDEX CONCURRENTLY, + REINDEX CONCURRENTLY, ALTER TABLE VALIDATE and other ALTER TABLE variants (for full details see ). diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 3908ade37b..24464020cd 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } name +REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name @@ -68,9 +68,12 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } CONCURRENTLY option failed, leaving an invalid index. Such indexes are useless but it can be convenient to use REINDEX to rebuild them. Note that - REINDEX will not perform a concurrent build. To build the - index without interfering with production you should drop the index and - reissue the CREATE INDEX CONCURRENTLY command. + REINDEX will perform a concurrent build if + CONCURRENTLY is specified. To build the index without interfering + with production you should drop the index and reissue either the + CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY + command. Indexes of toast relations can be rebuilt with REINDEX + CONCURRENTLY. @@ -152,6 +155,21 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } +CONCURRENTLY + + + When this option is used, PostgreSQL will rebuild the + index without taking any locks that prevent concurrent inserts, + updates, or deletes on the table; whereas a standard reindex build + locks out writes (but not reads) on the table until it's done. + There are several caveats to be aware of when using this option + — see . + + + + + VERBOSE @@ -231,6 +249,172 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } + + Rebuilding Indexes Concurrently + + + index + rebuilding concurrently + + + +Rebuilding an index can interfere with regular operation of a database. +Normally PostgreSQL locks the table whose index is rebuilt +against writes and performs the entire index build with a single scan of the +table. Other transactions can still read the table, but if they try to +insert, update, or delete rows in the table they will block until the +index rebuild is finished. This could have a severe effect if the system is +a live production database. Very large tables can take many hours to be +indexed, and even for smaller tables, an index rebuild can lock out writers +for periods that are unacceptably long for a production system. + + + +PostgreSQL supports rebuilding index
Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)
Just realized that '?' means "unknown transactional status" in %x. That might cause confusion if a person had a prompt of %x%R. Is that enough reason to pick a different cue? Argh. "\?\.?[tfz]" seems distinctive enough. Note that %R uses "'=-*^!$( and %x uses *!?, which means that they already share 2 characters, so adding ? does not seem like a big issue if it was not one before. Otherwise, maybe "&" or "%", but it is less about a condition. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)
On Sat, Feb 11, 2017 at 2:43 AM, Fabien COELHO wrote: > > Ok, so that's not just PROMPT_READY, that's every prompt...which might be >> ok. ? is a great optional cue, and you're thinking on 2 levels deep, 2nd >> level always being '.'? >> > > Yep. The idea is to keep it short, but to still have something to say > "there are more levels" in the stack, hence the one dot. Basically I just > compressed your 4 level proposal, and added a separator to deal with the > preceding stuff and suggest the conditional. > > -- > Fabien. > Just realized that '?' means "unknown transactional status" in %x. That might cause confusion if a person had a prompt of %x%R. Is that enough reason to pick a different cue?
Re: [HACKERS] Access inside pg_node_tree from query?
> > There are no in-core operators or functions to manipulate pg_node_tree. > Thanks Michael, just checking!
Re: [HACKERS] Should we cacheline align PGXACT?
On Sat, Feb 11, 2017 at 4:17 PM, Tomas Vondra wrote: > On 02/11/2017 01:21 PM, Alexander Korotkov wrote: > >> Hi, Tomas! >> >> On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra >> mailto:tomas.von...@2ndquadrant.com>> >> wrote: >> >> As discussed at the Developer meeting ~ a week ago, I've ran a >> number of benchmarks on the commit, on a small/medium-size x86 >> machines. I currently don't have access to a machine as big as used >> by Alexander (with 72 physical cores), but it seems useful to verify >> the patch does not have negative impact on smaller machines. >> >> In particular I've ran these tests: >> >> * r/o pgbench >> * r/w pgbench >> * 90% reads, 10% writes >> * pgbench with skewed distribution >> * pgbench with skewed distribution and skipping >> >> >> Thank you very much for your efforts! >> I took a look at these tests. One thing catch my eyes. You warmup >> database using pgbench run. Did you consider using pg_prewarm instead? >> >> SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE >> relkind IN ('i', 'r') ORDER BY oid) x; >> >> In my experience pg_prewarm both takes less time and leaves less >> variation afterwards. >> >> > I've considered it, but the problem I see in using pg_prewarm for > benchmarking purposes is that it only loads the data into memory, but it > does not modify the tuples (so all tuples have the same xmin/xmax, no dead > tuples, ...), it does not set usage counters on the buffers and also does > not generate any clog records. > Yes, but please note that pgbench runs VACUUM first, and all the tuples would be hinted. In order to xmin/xmax and clog take effect, you should run subsequent pgbench with --no-vacuum. Also usage counters of buffers make sense only when eviction happens, i.e. data don't fit shared_buffers. Also, you use pgbench -S to warmup before readonly test. I think pg_prewarm would be much better there unless your data is bigger than shared_buffers. I don't think there's a lot of variability in the results I measured. If > you look at (max-min) for each combination of parameters, the delta is > generally within 2% of average, with a very few exceptions, usually caused > by the first run (so perhaps the warmup should be a bit longer). You also could run pg_prewarm before warmup pgbench for readwrite test. In my intuition you should get more stable results with shorter warmup. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters
I'm a beginner here... anyway I try to share my ideas. My situation is changed in a worst state: I'm no more able to make a pg_dump neither with my custom fetch value (I have tried "1" as value = one row at the time) neither without the "--column-inserts": pg_dump: Dumping the contents of table "tDocumentsFiles" failed: PQgetResult() failed. pg_dump: Error message from server: ERROR: out of memory DETAIL: Failed on request of size 1073741823. pg_dump: The command was: COPY public."tDocumentsFiles" ("ID_Document", "ID_File", "Name", "FileName", "Link", "Note", "Picture", "Content", "FileSize", "FileDateTime", "DrugBox", "DrugPicture", "DrugInstructions") TO stdout; I don't know if the Kyotaro Horiguchi patch will solve this, because, again, I'm not able to get neither one single row. Similar problem trying to read and to write the bloab fields with my program. Actually I'm working via pieces: Read r1) I get the length of the bloab field r2) I check the available free memory (on the client pc) r3) I read pieces of the bloab field, according to the free memory, appending them to a physical file Write w1) I check the length of the file to save inside the bloab w2) I check the available free memory (on the client pc) w3) I create a temporary table on the server w4) I add lines to this temporary table, writing pieces of the file according to the free memory w5) I ask the server to write, inside the final bloab field, the concatenation of the rows of the temporary data The read and write is working now. Probably the free memory check should be done on both sides (client and server [does a function/view with the available free memory exist?]) taking the smallest one. What do you think to use a similar approach in the pg_dump? a) go through the table getting the size of each row / fields b) when the size of the row or of the field is bigger than the value (provided or stored somewhere), read pieces of the field till the end PS: I have see there are the "large object" that can work via streams. My files are actually not bigger than 1Gb, but, ok, maybe in the future I will use them instead of the bloabs. Thank you Andrea -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Index Scans
On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila wrote: >> Why can't we rely on _bt_walk_left? > > The reason is mentioned in comments, but let me try to explain with > some example. When you reach that point of code, it means that either > the current page (assume page number is 10) doesn't contain any > matching items or it is a half-dead page, both of which indicates that > we have to move to the previous page. Now, before checking if the > current page contains matching items, we signal parallel machinery > (via _bt_parallel_release) to allow workers to read the previous page > (assume previous page number is 9). So it is quite possible that > after deciding that current page (page number 10) doesn't contain any > matching tuples if we directly move to the previous page (in this case > it will be 9) by using _bt_walk_left, some other worker would have > read page 9. In short, if we directly use _bt_walk_left(), then we > are prone to returning some of the values twice as multiple workers > can read the same page. But ... the entire point of the seize-and-release stuff is to avoid this problem. You're suppose to seize the scan, read the current page, walk left, store the page you find in the scan, and then release the scan. The entire point of that stuff is that when somebody's advancing the scan to the next page, everybody else waits for them to get done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra wrote: > Incidentally, I've been dealing with a checksum failure reported by a > customer last week, and based on the experience I tend to agree that we > don't have the tools needed to deal with checksum failures. I think such > tooling should be a 'must have' for enabling checksums by default. > > In this particular case the checksum failure is particularly annoying > because it happens during recovery (on a standby, after a restart), during > startup, so FATAL means shutdown. > > I've managed to inspect the page in different way (dd and pageinspect from > another instance), and it looks fine - no obvious data corruption, the only > thing that seems borked is the checksum itself, and only three consecutive > bits are flipped in the checksum. So this doesn't seem like a "stale > checksum" - hardware issue is a possibility (the machine has ECC RAM > though), but it might just as easily be a bug in PostgreSQL, when something > scribbles over the checksum due to a buffer overflow, just before we write > the buffer to the OS. So 'false failures' are not entirely impossible thing. > > And no, backups may not be a suitable solution - the failure happens on a > standby, and the page (luckily) is not corrupted on the master. Which means > that perhaps the standby got corrupted by a WAL, which would affect the > backups too. I can't verify this, though, because the WAL got removed from > the archive, already. But it's a possibility. > > So I think we're not ready to enable checksums by default for everyone, not > until we can provide tools to deal with failures like this (I don't think > users will be amused if we tell them to use 'dd' and inspect the pages in a > hex editor). > > ISTM the way forward is to keep the current default (disabled), but to allow > enabling checksums on the fly. That will mostly fix the issue for people who > actually want checksums but don't realize they need to enable them at initdb > time (and starting from scratch is not an option for them), are running on > good hardware and are capable of dealing with checksum errors if needed, > even without more built-in tooling. > > Being able to disable checksums on the fly is nice, but it only really > solves the issue of extra overhead - it does really help with the failures > (particularly when you can't even start the database, because of a checksum > failure in the startup phase). > > So, shall we discuss what tooling would be useful / desirable? FWIW, I appreciate this analysis and I think it's exactly the kind of thing we need to set a strategy for moving forward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
> FWIW it might be interesting to have comparable results from the same > benchmarks I did. The scripts available in the git repositories should not > be that hard to tweak. Let me know if you're interested and need help with > that. > Sure, I will have a look into those scripts once I am done with the simple pgbench testing. Thanks and sorry for the delayed response. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: two slab-like memory allocators
On 02/11/2017 02:33 AM, Andres Freund wrote: On 2017-02-11 02:13:59 +0100, Tomas Vondra wrote: On 02/09/2017 10:37 PM, Andres Freund wrote: Hi, On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote: src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/aset.c | 115 + src/backend/utils/mmgr/memdebug.c | 131 ++ src/include/utils/memdebug.h | 22 +++ 4 files changed, 156 insertions(+), 114 deletions(-) create mode 100644 src/backend/utils/mmgr/memdebug.c I'm a bit loathe to move these to a .c file - won't this likely make these debugging tools even slower? Seems better to put some of them them in a header as static inlines (not randomize, but the rest). Do you have any numbers to support that? AFAICS compilers got really good in inlining stuff on their own. Unless you use LTO, they can't inline across translation units. And using LTO is slow enough for linking that it's not that much fun to use, as it makes compile-edit-compile cycles essentially take as long as a full rebuild. From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Wed, 30 Nov 2016 15:36:23 +0100 Subject: [PATCH 2/3] slab allocator --- src/backend/replication/logical/reorderbuffer.c | 82 +-- src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/slab.c | 803 src/include/nodes/memnodes.h| 2 +- src/include/nodes/nodes.h | 1 + src/include/replication/reorderbuffer.h | 15 +- src/include/utils/memutils.h| 9 + I'd like to see the reorderbuffer changes split into a separate commit from the slab allocator introduction. I rather dislike patches that only add a bunch of code, without actually using it anywhere. But if needed, this is trivial to do at commit time - just don't commit the reorderbuffer bits. Meh. + * Each block includes a simple bitmap tracking which chunks are used/free. + * This makes it trivial to check if all chunks on the block are free, and + * eventually free the whole block (which is almost impossible with a global + * freelist of chunks, storing chunks from all blocks). Why is checking a potentially somewhat long-ish bitmap better than a simple counter, or a "linked list" of "next free chunk-number" or such (where free chunks simply contain the id of the subsequent chunk)? Using a list instead of a bitmap would also make it possible to get 'lifo' behaviour, which is good for cache efficiency. A simple chunk-number based singly linked list would only imply a minimum allocation size of 4 - that seems perfectly reasonable? A block-level counter would be enough to decide if all chunks on the block are free, but it's not sufficient to identify which chunks are free / available for reuse. The bitmap only has a single bit per chunk, so I find "potentially long-ish" is a bit misleading. Any linked list implementation will require much more per-chunk overhead - as the chunks are fixed-legth, it's possible to use chunk index (instead of 64-bit pointers), to save some space. But with large blocks / small chunks that's still at least 2 or 4 bytes per index, and you'll need two (to implement doubly-linked list, to make add/remove efficient). For example assume 8kB block and 64B chunks, i.e. 128 chunks. With bitmap that's 16B to track all free space on the block. Doubly linked list would require 1B per chunk index, 2 indexes per chunk. That's 128*2 = 256B. I have a hard time believing this the cache efficiency of linked lists (which may or may not be real in this case) out-weights this, but if you want to try, be my guest. I'm not following - why would there be overhead in anything for allocations bigger than 4 (or maybe 8) bytes? You can store the list (via chunk ids, not pointers) inside the chunks itself, where data otherwise would be. And I don't see why you'd need a doubly linked list, as the only operations that are needed are to push to the front of the list, and to pop from the front of the list - and both operations are simple to do with a singly linked list? Oh! I have not considered storing the chunk indexes (for linked lists) in the chunk itself, which obviously eliminates the overhead concerns, and you're right a singly-linked list should be enough. There will be some minimum-chunk-size requirement, depending on the block size/chunk size. I wonder whether it makes sense to try to be smart and make it dynamic, so that we only require 1B or 2B for cases when only that many chunks fit into a block, or just say that it's 4B and be done with it. I mean 2^32 chunks ought to be enough for anyone, right? I'm still not buying the cache efficiency argument, though. One of the reasons is that the implementation prefers blocks with fewer free chunks when handling palloc(), so pfre
Re: [HACKERS] LWLock optimization for multicore Power machines
On 02/11/2017 01:42 PM, Alexander Korotkov wrote: I think it would make sense to run more kinds of tests. Could you try set of tests provided by Tomas Vondra? If even we wouldn't see win some of the tests, it would be still valuable to see that there is no regression there. FWIW it shouldn't be difficult to tweak my scripts and run them on another machine. You'd have to customize the parameters (scales, client counts, ...) and there are a few hard-coded paths, but that's about it. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
On 02/11/2017 01:21 PM, Alexander Korotkov wrote: Hi, Tomas! On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote: As discussed at the Developer meeting ~ a week ago, I've ran a number of benchmarks on the commit, on a small/medium-size x86 machines. I currently don't have access to a machine as big as used by Alexander (with 72 physical cores), but it seems useful to verify the patch does not have negative impact on smaller machines. In particular I've ran these tests: * r/o pgbench * r/w pgbench * 90% reads, 10% writes * pgbench with skewed distribution * pgbench with skewed distribution and skipping Thank you very much for your efforts! I took a look at these tests. One thing catch my eyes. You warmup database using pgbench run. Did you consider using pg_prewarm instead? SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i', 'r') ORDER BY oid) x; In my experience pg_prewarm both takes less time and leaves less variation afterwards. I've considered it, but the problem I see in using pg_prewarm for benchmarking purposes is that it only loads the data into memory, but it does not modify the tuples (so all tuples have the same xmin/xmax, no dead tuples, ...), it does not set usage counters on the buffers and also does not generate any clog records. I don't think there's a lot of variability in the results I measured. If you look at (max-min) for each combination of parameters, the delta is generally within 2% of average, with a very few exceptions, usually caused by the first run (so perhaps the warmup should be a bit longer). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock optimization for multicore Power machines
On Wed, Feb 8, 2017 at 5:00 PM, Bernd Helmle wrote: > Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov: > > But win isn't > > as high as I observed earlier. And I wonder why absolute numbers are > > lower > > than in our earlier experiments. We used IBM E880 which is actually > > two > > Did you run your tests on bare metal or were they also virtualized? > I run tests on bare metal. > nodes with interconnect. However interconnect is not fast enough to > > make > > one PostgreSQL instance work on both nodes. Thus, used half of IBM > > E880 > > which has 4 sockets and 32 physical cores. While you use IBM E850 > > which is > > really single node with 4 sockets and 48 physical cores. Thus, it > > seems > > that you have lower absolute numbers on more powerful hardware. That > > makes > > me uneasy and I think we probably don't get the best from this > > hardware. > > Just in case, do you use SMT=8? > > Yes, SMT=8 was used. > > The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The > Ubuntu LPAR running on PowerVM isn't using all physical cores, > currently 28 cores are assigned (=224 SMT Threads). The other cores are > dedicated to the PowerVM hypervisor and a (very) small AIX LPAR. > Thank you very much for the explanation. Thus, I see reasons why in your tests absolute results are lower than in my previous tests. 1. You use 28 physical cores while I was using 32 physical cores. 2. You run tests in PowerVM while I was running test on bare metal. PowerVM could have some overhead. 3. I guess you run pgbench on the same machine. While in my tests pgbench was running on another node of IBM E880. Therefore, having lower absolute numbers in your tests, win of LWLock optimization is also lower. That is understandable. But win of LWLock optimization is clearly visible definitely exceeds variation. I think it would make sense to run more kinds of tests. Could you try set of tests provided by Tomas Vondra? If even we wouldn't see win some of the tests, it would be still valuable to see that there is no regression there. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Access inside pg_node_tree from query?
On Sat, Feb 11, 2017 at 7:19 PM, Ryan Murphy wrote: > Quick question, just curious - is there a way to access the members of a > `pg_node_tree` value within a Postgres query? By pg_node_tree I mean for > example, the `ev_qual` field in the `pg_rewrite` table. By "access the > members" I mean in the same way that you can access the members of a json or > jsonp value (using -> or ->>). Is there anything analogous for > pg_node_tree? There are no in-core operators or functions to manipulate pg_node_tree. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
Hi, Tomas! On Sat, Feb 11, 2017 at 2:28 AM, Tomas Vondra wrote: > As discussed at the Developer meeting ~ a week ago, I've ran a number of > benchmarks on the commit, on a small/medium-size x86 machines. I currently > don't have access to a machine as big as used by Alexander (with 72 > physical cores), but it seems useful to verify the patch does not have > negative impact on smaller machines. > > In particular I've ran these tests: > > * r/o pgbench > * r/w pgbench > * 90% reads, 10% writes > * pgbench with skewed distribution > * pgbench with skewed distribution and skipping > Thank you very much for your efforts! I took a look at these tests. One thing catch my eyes. You warmup database using pgbench run. Did you consider using pg_prewarm instead? SELECT sum(x.x) FROM (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i', 'r') ORDER BY oid) x; In my experience pg_prewarm both takes less time and leaves less variation afterwards. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK
On 10/02/17 19:55, Masahiko Sawada wrote: > On Thu, Feb 9, 2017 at 12:44 AM, Petr Jelinek > wrote: >> On 08/02/17 07:40, Masahiko Sawada wrote: >>> On Wed, Feb 8, 2017 at 9:01 AM, Michael Paquier >>> wrote: On Wed, Feb 8, 2017 at 1:30 AM, Fujii Masao wrote: > On Wed, Feb 8, 2017 at 12:26 AM, Petr Jelinek > wrote: >> For example what happens if apply crashes during the DROP >> SUBSCRIPTION/COMMIT and is not started because the delete from catalog >> is now visible so the subscription is no longer there? > > Another idea is to treat DROP SUBSCRIPTION in the same way as VACUUM, > i.e., > make it emit an error if it's executed within user's transaction block. It seems to me that this is exactly Petr's point: using PreventTransactionChain() to prevent things to happen. >>> >>> Agreed. It's better to prevent to be executed inside user transaction >>> block. And I understood there is too many failure scenarios we need to >>> handle. >>> > Also DROP SUBSCRIPTION should call CommitTransactionCommand() just > after removing the entry from pg_subscription, then connect to the > publisher > and remove the replication slot. For consistency that may be important. >>> >>> Agreed. >>> >>> Attached patch, please give me feedback. >>> >> >> This looks good (and similar to what initial patch had btw). Works fine >> for me as well. >> >> Remaining issue is, what to do about CREATE SUBSCRIPTION then, there are >> similar failure scenarios there, should we prevent it from running >> inside transaction as well? >> > > Hmm, after thought I suspect current discussing approach. For > example, please image the case where CRAETE SUBSCRIPTION creates > subscription successfully but fails to create replication slot for > whatever reason, and then DROP SUBSCRIPTION drops the subscription but > dropping replication slot is failed. In such case, CREAET SUBSCRIPTION > and DROP SUBSCRIPTION return ERROR but the subscription is created and > dropped successfully. I think that this behaviour confuse the user. > > I think we should just prevent calling DROP SUBSCRIPTION in user's > transaction block. Or I guess that it could be better to separate the > starting/stopping logical replication from subscription management. > We need to stop the replication worker(s) in order to be able to drop the slot. There is no such issue with startup of the worker as that one is launched by launcher after the transaction has committed. IMO best option is to just don't allow DROP/CREATE SUBSCRIPTION inside a transaction block and don't do any commits inside of those (so that there are no rollbacks, which solves your initial issue I believe). That way failure to create/drop slot will result in subscription not being created/dropped which is what we want. Note that we do give users options to not create and not drop slots if they wish so we should really treat slot related failures as command errors. I don't want subscription setup to be 20 step where you have to deal with various things on multiple servers when it can be just plain simple one command on each side in many cases. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
Hi, Am Samstag, den 11.02.2017, 11:25 +0100 schrieb Michael Banck: > Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander: > > As for the code, while I haven't tested it, isn't the "checkpoint > > completed" message in the wrong place? Doesn't PQsendQuery() complete > > immediately, and the check needs to be put *after* the PQgetResult() > > call? > > I guess you're right, I've moved it further down. There is in fact a > message about the xlog location (unless you switch off wal entirely), > but having another one right before that mentioning the completed > checkpoint sounds ok to me. > > There's also some inconsistencies around which messages are prepended > with "pg_basebackup: " and which are translatable; I guess all messages > printed on --verbose should be translatable? Also, as almost all > messages have a "pg_basebackup: " prefix, I've added it to the rest. Sorry, there were two typoes in the last patch, I've attached a fixed one. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c9dd62c..a298e5c 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -660,6 +660,14 @@ PostgreSQL documentation Notes + At the beginning of the backup, a checkpoint needs to be written on the + server the backup is taken from. Especially if the option + --checkpoint=fast is not used, this can take some time + during which pg_basebackup will be idle on the + server it is running on. + + + The backup will include all files in the data directory and tablespaces, including the configuration files and any additional files placed in the directory by third parties, except certain temporary files managed by diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index b6463fa..60200a9 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1754,6 +1754,11 @@ BaseBackup(void) if (maxrate > 0) maxrate_clause = psprintf("MAX_RATE %u", maxrate); + if (verbose) + fprintf(stderr, +_("%s: initiating base backup, waiting for checkpoint to complete\n"), +progname); + basebkp = psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s", escaped_label, @@ -1791,6 +1796,9 @@ BaseBackup(void) strlcpy(xlogstart, PQgetvalue(res, 0, 0), sizeof(xlogstart)); + if (verbose) + fprintf(stderr, _("%s: checkpoint completed\n"), progname); + /* * 9.3 and later sends the TLI of the starting point. With older servers, * assume it's the same as the latest timeline reported by @@ -1804,8 +1812,8 @@ BaseBackup(void) MemSet(xlogend, 0, sizeof(xlogend)); if (verbose && includewal != NO_WAL) - fprintf(stderr, _("transaction log start point: %s on timeline %u\n"), -xlogstart, starttli); + fprintf(stderr, _("%s: transaction log start point: %s on timeline %u\n"), +progname, xlogstart, starttli); /* * Get the header @@ -1907,7 +1915,7 @@ BaseBackup(void) } strlcpy(xlogend, PQgetvalue(res, 0, 0), sizeof(xlogend)); if (verbose && includewal != NO_WAL) - fprintf(stderr, "transaction log end point: %s\n", xlogend); + fprintf(stderr, _("%s: transaction log end point: %s\n"), progname, xlogend); PQclear(res); res = PQgetResult(conn); @@ -2048,7 +2056,7 @@ BaseBackup(void) } if (verbose) - fprintf(stderr, "%s: base backup completed\n", progname); + fprintf(stderr, _("%s: base backup completed\n"), progname); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
Hi, Am Samstag, den 11.02.2017, 11:07 +0100 schrieb Magnus Hagander: > As for the code, while I haven't tested it, isn't the "checkpoint > completed" message in the wrong place? Doesn't PQsendQuery() complete > immediately, and the check needs to be put *after* the PQgetResult() > call? I guess you're right, I've moved it further down. There is in fact a message about the xlog location (unless you switch off wal entirely), but having another one right before that mentioning the completed checkpoint sounds ok to me. There's also some inconsistencies around which messages are prepended with "pg_basebackup: " and which are translatable; I guess all messages printed on --verbose should be translatable? Also, as almost all messages have a "pg_basebackup: " prefix, I've added it to the rest. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c9dd62c..a298e5c 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -660,6 +660,14 @@ PostgreSQL documentation Notes + At the beginning of the backup, a checkpoint needs to be written on the + server the backup is taken from. Especially if the option + --checkpoint=fast is not used, this can take some time + during which pg_basebackup will be idle on the + server it is running on. + + + The backup will include all files in the data directory and tablespaces, including the configuration files and any additional files placed in the directory by third parties, except certain temporary files managed by diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index b6463fa..874b6d6 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1754,6 +1754,11 @@ BaseBackup(void) if (maxrate > 0) maxrate_clause = psprintf("MAX_RATE %u", maxrate); + if (verbose) + fprintf(stderr, +_("%s: initiating base backup, waiting for checkpoint to complete\n"), +progname); + basebkp = psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s", escaped_label, @@ -1791,6 +1796,9 @@ BaseBackup(void) strlcpy(xlogstart, PQgetvalue(res, 0, 0), sizeof(xlogstart)); + if (verbose) + fprintf(stderr, _("%s: checkpoint completed\n"), progname); + /* * 9.3 and later sends the TLI of the starting point. With older servers, * assume it's the same as the latest timeline reported by @@ -1804,8 +1812,8 @@ BaseBackup(void) MemSet(xlogend, 0, sizeof(xlogend)); if (verbose && includewal != NO_WAL) - fprintf(stderr, _("transaction log start point: %s on timeline %u\n"), -xlogstart, starttli); + fprintf(stderr, _("%s: transaction log start point: %s on timeline %u\n"), +progname, xlogstart, starttli); /* * Get the header @@ -1907,7 +1915,7 @@ BaseBackup(void) } strlcpy(xlogend, PQgetvalue(res, 0, 0), sizeof(xlogend)); if (verbose && includewal != NO_WAL) - fprintf(stderr, "transaction log end point: %s\n", xlogend); + fprintf(stderr, _("%s: transaction log end point: %s\n", progname, xlogend); PQclear(res); res = PQgetResult(conn); @@ -2048,7 +2056,7 @@ BaseBackup(void) } if (verbose) - fprintf(stderr, "%s: base backup completed\n", progname); + fprintf(stderr, _("%s: base backup completed\n)", progname); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Access inside pg_node_tree from query?
Hi Postgressers, Quick question, just curious - is there a way to access the members of a `pg_node_tree` value within a Postgres query? By pg_node_tree I mean for example, the `ev_qual` field in the `pg_rewrite` table. By "access the members" I mean in the same way that you can access the members of a json or jsonp value (using -> or ->>). Is there anything analogous for pg_node_tree? Thanks! Ryan
Re: [HACKERS] Logical replication existing data copy
On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Apart from the failing one make check test (test 'object_address') which I reported earlier, I find it is easy to 'confuse' the replication. I attach a script that intends to test the default COPY DATA. There are two instances, initially without any replication. The script inits pgbench on the master, adds a serial column to pgbench_history, and dump-restores the 4 pgbench-tables to the future replica. It then empties the 4 pgbench-tables on the 'replica'. The idea is that when logrep is initiated, data will be replicated from master, with the end result being that there are 4 identical tables on master and replica. This often works but it also fails far too often (in my hands). I test whether the tables are identical by comparing an md5 from an ordered resultset, from both replica and master. I estimate that 1 in 5 tries fail; 'fail' being a somewhat different table on replica (compared to mater), most often pgbench_accounts (typically there are 10-30 differing rows). No errors or warnings in either logfile. I'm not sure but I think testing on faster machines seem to be doing somewhat better ('better' being less replication error). Another, probably unrelated, problem occurs (but much more rarely) when executing 'DROP SUBSCRIPTION sub1' on the replica (see the beginning of the script). Sometimes that command hangs, and refuses to accept shutdown of the server. I don't know how to recover from this -- I just have to kill the replica server (master server still obeys normal shutdown) and restart the instances. The script accepts 2 parameters, scale and clients (used in pgbench -s resp. -c) I don't think I've managed to successfully run the script with more than 1 client yet. Can you have a look whether this is reproducible elsewhere? thanks, Erik Rijkers #!/bin/sh # assumes both instances are running, on port 6972 and 6973 logfile1=$HOME/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication logfile2=$HOME/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2 scale=1 if [[ ! "$1" == "" ]] then scale=$1 fi clients=1 if [[ ! "$2" == "" ]] then clients=$2 fi unset PGSERVICEFILE PGSERVICE PGPORT PGDATA PGHOST PGDATABASE=testdb # (this script also uses a custom pgpassfile) ## just for info: # env | grep PG # psql -qtAXc "select current_setting('server_version')" port1=6972 port2=6973 function cb() { # display the 4 pgbench tables' accumulated content as md5s # a,b,t,h stand for: pgbench_accounts, -branches, -tellers, -history md5_total_6972='-1' md5_total_6973='-2' for port in $port1 $port2 do md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_t=$(echo "select * from pgbench_tellers order by tid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_h=$(echo "select * from pgbench_history order by hid"|psql -qtAXp$port|md5sum|cut -b 1-9) cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp $port) cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp $port) cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port) cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port) md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum ) printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h echo -n " $md5_a $md5_b $md5_t $md5_h" if [[ $port -eq $port1 ]]; then echo" master" elif [[ $port -eq $port2 ]]; then echo -n " replica" else echo" ERROR " fi done if [[ "${md5_total[6972]}" == "${md5_total[6973]}" ]] then echo " ok" else echo " NOK" fi } bail=0 pub_count=$( echo "select count(*) from pg_publication" | psql -qtAXp 6972 ) if [[ $pub_count -ne 0 ]] then echo "pub_count -ne 0 - deleting pub1 & bailing out" echo "drop publication if exists pub1" | psql -Xp 6972 bail=1 fi sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 ) if [[ $sub_count -ne 0 ]] then echo "sub_count -ne 0 - deleting sub1 & bailing out" echo "drop subscription if exists sub1" | psql -Xp 6973 rc=$? echo "(drop subscr. ) ) rc=$rc" bail=1 fi pub_count=$( echo "select count(*) from pg_publication" | psql -qtAXp 6972 ) sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 ) #if [[ $bail -eq 1 ]] #then #if [[ $pub_count -eq 0 ]] && [[ $sub_c
Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck wrote: > Hi, > > one take-away from the Gitlab Post-Mortem[1] appears to be that after > their secondary lost replication, they were confused about what > pg_basebackup was doing when they tried to rebuild it. It just sat there > and did nothing (even with --verbose), so they assumed something was > wrong with either the primary or the connection, and restarted it > several times. > > AFAICT, it turns out the checkpoint was written on the master (they > probably did not use -c fast), but this wasn't obvious to them: > Yeah, I've seen this happen to a number of people. I think that sounds like what's happened here as well. I've considered things in the line of the patch you posted, but never got around to actually doing anything about it. > ISTM that even with WAL streaming, nothing would be written on the > client server until the checkpoint is complete, as do_pg_start_backup() > runs the checkpoint and only returns the starting WAL location > afterwards. > > The attached (untested) patch is to kick of a discussion on how to > improve the situation, it is supposed to mention the checkpoint when > --verbose is used and adds a paragraph about the checkpoint being run to > the Notes section of the documentation. > > Docs look good to me, other than claiming that pg_basebackup runs on a server (it can run anywhere). I would just say "during which pg_basebackup will appear idle". How does that sound to you? As for the code, while I haven't tested it, isn't the "checkpoint completed" message in the wrong place? Doesn't PQsendQuery() complete immediately, and the check needs to be put *after* the PQgetResult() call? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
[HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint
Hi, one take-away from the Gitlab Post-Mortem[1] appears to be that after their secondary lost replication, they were confused about what pg_basebackup was doing when they tried to rebuild it. It just sat there and did nothing (even with --verbose), so they assumed something was wrong with either the primary or the connection, and restarted it several times. AFAICT, it turns out the checkpoint was written on the master (they probably did not use -c fast), but this wasn't obvious to them: "One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup. Unfortunately pg_basebackup would hang, producing no meaningful output, despite the --verbose option being set." [...] "Unfortunately this did not resolve the problem of pg_basebackup not starting replication immediately. One of the engineers decided to run it with strace to see what it was blocking on. strace showed that pg_basebackup was hanging in a poll call, but that did not provide any other meaningful information that might have explained why." [...] "It would later be revealed by another engineer (who wasn't around at the time) that this is normal behavior: pg_basebackup will wait for the primary to start sending over replication data and it will sit and wait silently until that time. Unfortunately this was not clearly documented in our engineering runbooks nor in the official pg_basebackup document." ISTM that even with WAL streaming, nothing would be written on the client server until the checkpoint is complete, as do_pg_start_backup() runs the checkpoint and only returns the starting WAL location afterwards. The attached (untested) patch is to kick of a discussion on how to improve the situation, it is supposed to mention the checkpoint when --verbose is used and adds a paragraph about the checkpoint being run to the Notes section of the documentation. Michael [1]https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c9dd62c..a298e5c 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -660,6 +660,14 @@ PostgreSQL documentation Notes + At the beginning of the backup, a checkpoint needs to be written on the + server the backup is taken from. Especially if the option + --checkpoint=fast is not used, this can take some time + during which pg_basebackup will be idle on the + server it is running on. + + + The backup will include all files in the data directory and tablespaces, including the configuration files and any additional files placed in the directory by third parties, except certain temporary files managed by diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index b6463fa..ae18c16 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1754,6 +1754,9 @@ BaseBackup(void) if (maxrate > 0) maxrate_clause = psprintf("MAX_RATE %u", maxrate); + if (verbose) + fprintf(stderr, "%s: initiating base backup, waiting for checkpoint to complete\n", progname); + basebkp = psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s", escaped_label, @@ -1771,6 +1774,9 @@ BaseBackup(void) disconnect_and_exit(1); } + if (verbose) + fprintf(stderr, "%s: checkpoint completed\n", progname); + /* * Get the starting xlog position */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers