Re: [HACKERS] sepgsql: label regression test failed
On 05/14/2014 07:33 AM, Sergey Muraviov wrote: I've got this compiler warning: relation.c: In function ‘sepgsql_relation_drop’: relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this function [-Wmaybe-uninitialized] sepgsql_avc_check_perms(object, ^ KaiGei, could you take a look at this warning, too? It looks like a genuine bug to me, but I'm not sure what we should do there instead. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Buffer manager scalability and correlated reference period
I have performed a new benchmark related to my ongoing experimentation around caching and buffer manager scalability. The benchmark tests a minor refinement of the prototype patch previously posted [1]. The patch itself is still very much a prototype, and does not significantly differ from what I originally posted. The big difference is usage_count starts at 6, and saturates at 30, plus I've tried to reduce the impact of the prior prototype's gettimeofday() calls by using clock_gettime() + CLOCK_MONOTONIC_COARSE. I previously posted some numbers for a patch with just the former change. I effectively disabled the background writer entirely here, since it never helps. These are unlogged tables, so as to not have the outcome obscured by checkpoint spikes during the sync phase that are more or less inevitable here (I believe this is particularly true given the hardware I'm using). Multiple client counts are tested, giving some indication of the impact on scalability. The same gains previously demonstrated in both transaction throughput and latency are once again clearly in evidence. I should emphasize that although I've talked a lot about LRU-K and other more sophisticated algorithms, this proof of concept still only adds a correlated reference period (while allowing usage_count to span a larger range). I have yet to come up with something really interesting, such as a patch that makes an inference about the frequency of access of a page based on the recency of its penultimate access (that is, a scheme that is similar to LRU-K, a scheme known to be used in other systems [2] and thought to be widely influential). The benchmark results are available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/collector-correlate This report was built using pgbench-collector, my fork of pgbench-tools. It is currently under very active development. The largest difference between it and pgbench-tools is that it offers more advanced reporting of operating system information, which can be correlated with pgbench latency and TPS figures. It's hosted on Github: https://github.com/petergeoghegan/pgbench-collector . Patches are very much welcome. This benchmark doesn't show very much new information. I thought that it might be useful to have detailed operating system statistics to work off of for each test. I believe that better benchmarking tools will help the planned improvements to buffer manager scalability. There is likely to be multiple angles of attack. [1] http://www.postgresql.org/message-id/cam3swzrm7-qmogmczix5u49ndfbls_wwtx6vjsja+bn_li5...@mail.gmail.com [2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, Page 215 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.0 PDF build broken
On 2014-05-15 22:27:42 -0400, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: Mysteriously, commit 6b2a1445ec8a631060c4cbff3f172bf31d3379b9 has broken the PDF build (openjade + pdfjadetex) in the 9.0 branch only. The error is [256.0.28 ! pdfTeX error (ext4): \pdfendlink ended up in different nesting level than \pd fstartlink. I have reproduced this on two different platforms, and it affects only this branch. I guess this change might have caused the page boundaries to shift in an unfortunate way. I seem to recall we have had similar problems before. Does anyone remember? Yeah. This is caused by a hyperlink whose displayed text crosses a page boundary. The only known fix is to change the text enough so the link no longer runs across a page boundary. Unfortunately, pdfTeX is pretty unhelpful about identifying exactly where the problem is. I seem to recall having posted a recipe about finding such problems. Hm. Could a nonbreaking space be used inside such links to prevent the issue (i.e. nbsp;)? That seems slightly more robust than trying to rephrase sentences to avoid the issue. It'd even be nicer if those could be added automagically in the tex conversio. I have no idea how that works right now though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication woes
On 05/15/2014 11:58 PM, Andres Freund wrote: On 2014-05-15 22:30:53 +0300, Heikki Linnakangas wrote: Attached patch fixes things, but I want to add some regression tests before commit. Looks good to me. Attached are two patches. One for the unitialized dbId/tsId issue; one for the decoding bug. The former should be backpatched. Thanks, committed. Should you wonder about the slight reordering of the assignments in RecordTransactionCommitPrepared() - I've made it more similar to RecordTransactionCommit() to make it easier to see they are equivalent. Makes sense. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
I'm rebasing another implementation of this against current HEAD at the moment. It was well tested but has bitrotted a bit, in particular it needs merging with the multixact changes (eep). That should provide a useful basis for comparison and a chance to share ideas. I'll follow up with the patch and a git tree when it's ready, hopefully tonight. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Trigger concurrent execution
Hi, Thanks all for being patient, apparently I didn't quite understand the norms of trigger execution. On 16 May 2014 07:55, Craig Ringer cr...@2ndquadrant.com wrote: On 05/16/2014 08:06 AM, Blagoj Petrushev wrote: Hi, I'm thinking of an extension to trigger functionality like this: CREATE TRIGGER trigger_name AFTER event ON table CONCURRENTLY EXECUTE PROCEDURE trigger_fc This would call the trigger after the end of the transaction. If after the end of the transaction is what you mean by concurrently, then that's the wrong word to choose. AFTER COMMIT ? You're right, 'concurrently' is the wrong word. The concept of running a trigger concurrently just doesn't make sense in PostgreSQL, because the backend is single threaded. You wouldn't be able to run any SQL commands until the trigger finished. It isn't possible to do anything useful without a transaction, so PostgreSQL would need to start a transaction for the trigger and commit the transaction at the end, as if you'd run SELECT my_procedure();. Because it's outside the scope of the transaction it probably wouldn't be possible to do FOR EACH ROW with a NEW and OLD var, Right. Didn't think of this. unless you stashed them as materialized rows in the queue of pending AFTER COMMIT triggers. Finally, because it's after transaction commit, you couldn't easily guarantee that the trigger would really run. If the backend crashed / the server was shut down / etc after the commit but before your trigger finished, you'd have a committed transaction but the trigger would not run. To fix that you'd need to somehow make the trigger queue WAL-logged and run it during replay, which from my rather limited understanding of this area would be ... interesting to do. It'd also mean the trigger couldn't have any session context. This isn't easy, if it's practical at all. I have a big table with big text column article and a nullable tsvector column fts_article. On each insert or update that changes the article, I trigger-issue 'NOTIFY article_changed row_id', then, with a daemon listener, I catch the notification and update fts_article accordingly with my_fts_fc(article). The reason I don't do this directly in my trigger is because my_fts_fc is slow for big articles, fts_article has a gin index, and also, on heavy load, my listener can do these updates concurrently. Now, with a concurrent execution of triggers, I can just call my_fts_fc inside the trigger instead of the notify roundtrip. I don't think that really fits. It seems like you want to run the trigger procedure in the background on another back-end. That'd be quite cool, but also not trivial to do, especially if you wanted to guarantee that it happened reliably and in a crash-safe manner. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services I'll also try to reply to David G Johnston answer here, since I didn't actually get the email. Conceptually, trigger actions run in-transaction and can cause it to ROLLBACK; so how would after the end of the transaction work? Since the easy way is to have COMMIT; block until all the AFTER event concurrent triggers fire I presume you would want something more like a task queue for background workers where, at commit, the function call is in place in a FIFO queue and the calling session is allowed to move onto other activity. As I see, for my problem, it would be great if there's a way to put a function call in a queue in a background worker. I don't know how to do this, however. But, the crash handling and NEW/OLD vars passing would remain a problem nonetheless. It is not clear what you mean by my listener can do these updates concurrently? Concurrently with each other or concurrently with other DML action on table?I assume you have multiple listeners since the potential rate of insert of the documents is likely much greater than the rate of update/indexing. I meant concurrently with additional inserts to the table, as well as the fact that my listener is able to receive new notifications while updating some record. Also, it would seem you'd typically want the GIN index to be updated once the corresponding transaction committed and makes the rest of the data available. Or does your use case allow for some delay between the article being in the database physically and it being available in the index? The latter, namely, the search feature can 'lag' a few seconds after the content update. Thanks, Blagoj Petrushev -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] btree_gist macaddr valgrind woes
Hi, After Heikki has fixed the bit btree_gist bugs my valgrind run shows a bug in macaddr. Presumably it's because the type is a fixed width type with a length of 6. I guess it's allocating stuff sizeof(macaddr) but then passes the result around as a Datum which doesn't work well. ==14219== Invalid read of size 8 ==14219==at 0x4C2BB50: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:882) ==14219==by 0x45FA01: heap_fill_tuple (heaptuple.c:248) ==14219==by 0x4629A7: index_form_tuple (indextuple.c:132) ==14219==by 0x48ED05: gistFormTuple (gistutil.c:611) ==14219==by 0x499801: gistBuildCallback (gistbuild.c:476) ==14219==by 0x530412: IndexBuildHeapScan (index.c:2460) ==14219==by 0x4991D4: gistbuild (gistbuild.c:209) ==14219==by 0x8D813B: OidFunctionCall3Coll (fmgr.c:1650) ==14219==by 0x52F816: index_build (index.c:1962) ==14219==by 0x52E79D: index_create (index.c:1083) ==14219==by 0x5E9253: DefineIndex (indexcmds.c:600) ==14219==by 0x7A4DED: ProcessUtilitySlow (utility.c:1149) ==14219== Address 0x67d37a0 is 8 bytes inside a block of size 12 client-defined ==14219==at 0x8FB83A: palloc0 (mcxt.c:698) ==14219==by 0x6B6C3C4: gbt_num_compress (btree_utils_num.c:31) ==14219==by 0x6B74BF1: gbt_macad_compress (btree_macaddr.c:113) ==14219==by 0x8D73E2: FunctionCall1Coll (fmgr.c:1298) ==14219==by 0x48EB39: gistcentryinit (gistutil.c:576) ==14219==by 0x48ECA1: gistFormTuple (gistutil.c:603) ==14219==by 0x499801: gistBuildCallback (gistbuild.c:476) ==14219==by 0x530412: IndexBuildHeapScan (index.c:2460) ==14219==by 0x4991D4: gistbuild (gistbuild.c:209) ==14219==by 0x8D813B: OidFunctionCall3Coll (fmgr.c:1650) ==14219==by 0x52F816: index_build (index.c:1962) ==14219==by 0x52E79D: index_create (index.c:1083) ==14219== ==14219== Invalid read of size 8 ==14219==at 0x4C2BA70: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:882) ==14219==by 0x45FA01: heap_fill_tuple (heaptuple.c:248) ==14219==by 0x4629A7: index_form_tuple (indextuple.c:132) ==14219==by 0x48ED05: gistFormTuple (gistutil.c:611) ==14219==by 0x48C755: gistSplit (gist.c:1314) ==14219==by 0x488ED4: gistplacetopage (gist.c:242) ==14219==by 0x48C025: gistinserttuples (gist.c:1134) ==14219==by 0x48BFAA: gistinserttuple (gist.c:1093) ==14219==by 0x48AC56: gistdoinsert (gist.c:750) ==14219==by 0x49985B: gistBuildCallback (gistbuild.c:490) ==14219==by 0x530412: IndexBuildHeapScan (index.c:2460) ==14219==by 0x4991D4: gistbuild (gistbuild.c:209) ==14219== Address 0x67dba58 is 8 bytes inside a block of size 12 client-defined ==14219==at 0x8FB6D2: palloc (mcxt.c:678) ==14219==by 0x6B74D04: gbt_macad_union (btree_macaddr.c:145) ==14219==by 0x8D74D0: FunctionCall2Coll (fmgr.c:1324) ==14219==by 0x48D7C8: gistMakeUnionItVec (gistutil.c:198) ==14219==by 0x496E8E: gistunionsubkeyvec (gistsplit.c:64) ==14219==by 0x496F05: gistunionsubkey (gistsplit.c:91) ==14219==by 0x498981: gistSplitByKey (gistsplit.c:689) ==14219==by 0x48C423: gistSplit (gist.c:1270) ==14219==by 0x488ED4: gistplacetopage (gist.c:242) ==14219==by 0x48C025: gistinserttuples (gist.c:1134) ==14219==by 0x48BFAA: gistinserttuple (gist.c:1093) ==14219==by 0x48AC56: gistdoinsert (gist.c:750) ==14219== ==14250== Uninitialised byte(s) found during client check request ==14250==at 0x794E9F: PageAddItem (bufpage.c:314) ==14250==by 0x48CC95: gistfillbuffer (gistutil.c:42) ==14250==by 0x489CBC: gistplacetopage (gist.c:451) ==14250==by 0x48C025: gistinserttuples (gist.c:1134) ==14250==by 0x48C31C: gistfinishsplit (gist.c:1240) ==14250==by 0x48C0F5: gistinserttuples (gist.c:1159) ==14250==by 0x48BFAA: gistinserttuple (gist.c:1093) ==14250==by 0x48AC56: gistdoinsert (gist.c:750) ==14250==by 0x49985B: gistBuildCallback (gistbuild.c:490) ==14250==by 0x530412: IndexBuildHeapScan (index.c:2460) ==14250==by 0x4991D4: gistbuild (gistbuild.c:209) ==14250==by 0x8D813B: OidFunctionCall3Coll (fmgr.c:1650) ==14250== Address 0x6efbcfb is 19 bytes inside a block of size 40 client-defined ==14250==at 0x8FB83A: palloc0 (mcxt.c:698) ==14250==by 0x462954: index_form_tuple (indextuple.c:129) ==14250==by 0x48ED05: gistFormTuple (gistutil.c:611) ==14250==by 0x48C618: gistSplit (gist.c:1292) ==14250==by 0x488ED4: gistplacetopage (gist.c:242) ==14250==by 0x48C025: gistinserttuples (gist.c:1134) ==14250==by 0x48BFAA: gistinserttuple (gist.c:1093) ==14250==by 0x48AC56: gistdoinsert (gist.c:750) ==14250==by 0x49985B: gistBuildCallback (gistbuild.c:490) ==14250==by 0x530412: IndexBuildHeapScan (index.c:2460) ==14250==by 0x4991D4: gistbuild (gistbuild.c:209) ==14250==by 0x8D813B: OidFunctionCall3Coll (fmgr.c:1650) ==14250== Uninitialised value was created by a heap allocation ==14250==at 0x8FB6D2: palloc (mcxt.c:678) ==14250==by 0x6B6CFCE: gbt_var_key_copy
Re: [HACKERS] Priority table or Cache table
Hello, I applied the patch to current HEAD. There was one failure (attached), freelist.rej http://postgresql.1045698.n5.nabble.com/file/n5804200/freelist.rej Compiled the provided pgbench.c and added following in .conf shared_buffers = 128MB # min 128kB Shared_buffers=64MB Priority_buffers=128MB I was planning to performance test later hence different values. But while executing pgbench the following assertion occurs LOG: database system is ready to accept connections LOG: autovacuum launcher started TRAP: FailedAssertion(!(strategy_delta = 0), File: bufmgr.c, Line: 1435) LOG: background writer process (PID 10274) was terminated by signal 6: Aborted LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Is there a way to avoid it? Am i making some mistake? regards Sameer -- View this message in context: http://postgresql.1045698.n5.nabble.com/Priority-table-or-Cache-table-tp5792831p5804200.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sepgsql: label regression test failed
2014-05-16 16:26 GMT+09:00 Heikki Linnakangas hlinnakan...@vmware.com: On 05/14/2014 07:33 AM, Sergey Muraviov wrote: I've got this compiler warning: relation.c: In function ‘sepgsql_relation_drop’: relation.c:472:25: warning: ‘tclass’ may be used uninitialized in this function [-Wmaybe-uninitialized] sepgsql_avc_check_perms(object, ^ KaiGei, could you take a look at this warning, too? It looks like a genuine bug to me, but I'm not sure what we should do there instead. This warning is harmless, because the code path that does not initialize tclass variable (a case when dropped relation is index) never goes to the code path that references tclass. It just checks schema's {remove_name} permission, then jumps to another code path for index, never backed. BTW, I could not produce this message in my environment with -Wall. (Fedora 20, gcc-4.8.2). Is it a newer compiler's wisdom? Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp sepgsql-fixup-maybe-uninitialized-warnning.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Priority table or Cache table
On 20 Feb 2014, at 01:38, Tom Lane t...@sss.pgh.pa.us wrote: Haribabu Kommi kommi.harib...@gmail.com writes: I want to propose a new feature called priority table or cache table. This is same as regular table except the pages of these tables are having high priority than normal tables. These tables are very useful, where a faster query processing on some particular tables is expected. Why exactly does the existing LRU behavior of shared buffers not do what you need? I am really dubious that letting DBAs manage buffers is going to be an improvement over automatic management. regards, tom lane the reason for a feature like that is to define an area of the application which needs more predictable runtime behaviour. not all tables are created equals in term of importance. example: user authentication should always be supersonic fast while some reporting tables might gladly be forgotten even if they happened to be in use recently. i am not saying that we should have this feature. however, there are definitely use cases which would justify some more control here. otherwise people will fall back and use dirty tricks sucks as “SELECT count(*)” or so to emulate what we got here. many thanks, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 05/16/2014 04:46 PM, Craig Ringer wrote: I'll follow up with the patch and a git tree when it's ready, hopefully tonight. Here's a rebased version of Simon's original patch that runs on current master. I still need to merge the isolation tests for it merged and sorted out, and after re-reading it I'd like to change waitMode into an enum, not just some #defines . Hope it's useful for comparison and ideas. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From c9532344cdabcde4d1992659ec8be8ad4b0041ce Mon Sep 17 00:00:00 2001 From: Craig Ringer cr...@2ndquadrant.com Date: Wed, 31 Jul 2013 18:37:46 +0800 Subject: [PATCH] implement FOR UPDATE SKIP LOCKED --- src/backend/access/heap/heapam.c | 54 --- src/backend/executor/execMain.c | 2 +- src/backend/executor/nodeLockRows.c | 20 +++- src/backend/nodes/copyfuncs.c | 6 ++-- src/backend/nodes/equalfuncs.c| 4 +-- src/backend/nodes/outfuncs.c | 6 ++-- src/backend/nodes/readfuncs.c | 2 +- src/backend/optimizer/plan/planner.c | 4 +-- src/backend/optimizer/prep/prepsecurity.c | 8 ++--- src/backend/optimizer/prep/prepunion.c| 2 +- src/backend/parser/analyze.c | 17 +- src/backend/parser/gram.y | 23 - src/backend/rewrite/rewriteHandler.c | 18 +-- src/backend/utils/adt/ruleutils.c | 4 ++- src/include/access/heapam.h | 2 +- src/include/nodes/execnodes.h | 5 ++- src/include/nodes/parsenodes.h| 4 +-- src/include/nodes/plannodes.h | 2 +- src/include/parser/analyze.h | 2 +- src/include/parser/kwlist.h | 2 ++ 20 files changed, 118 insertions(+), 69 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index b77c32c..8e7f2bf 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -56,6 +56,7 @@ #include catalog/namespace.h #include miscadmin.h #include pgstat.h +#include nodes/execnodes.h #include storage/bufmgr.h #include storage/freespace.h #include storage/lmgr.h @@ -4081,7 +4082,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update) * cid: current command ID (used for visibility test, and stored into * tuple's cmax if lock is successful) * mode: indicates if shared or exclusive tuple lock is desired - * nowait: if true, ereport rather than blocking if lock not available + * waitMode: mode describes handling of lock waits * follow_updates: if true, follow the update chain to also lock descendant * tuples. * @@ -4105,7 +4106,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update) */ HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple, -CommandId cid, LockTupleMode mode, bool nowait, +CommandId cid, LockTupleMode mode, int waitMode, bool follow_updates, Buffer *buffer, HeapUpdateFailureData *hufd) { @@ -4208,16 +4209,21 @@ l3: */ if (!have_tuple_lock) { - if (nowait) + if (waitMode == WAITMODE_NORMAL) +LockTupleTuplock(relation, tid, mode); + else { if (!ConditionalLockTupleTuplock(relation, tid, mode)) - ereport(ERROR, +{ + if (waitMode == WAITMODE_SKIP_LOCKED) + return result; + else if (waitMode == WAITMODE_NOWAIT) + ereport(ERROR, (errcode(ERRCODE_LOCK_NOT_AVAILABLE), - errmsg(could not obtain lock on row in relation \%s\, - RelationGetRelationName(relation; + errmsg(could not obtain lock on row in relation \%s\, + RelationGetRelationName(relation; +} } - else -LockTupleTuplock(relation, tid, mode); have_tuple_lock = true; } @@ -4419,21 +4425,26 @@ l3: elog(ERROR, invalid lock mode in heap_lock_tuple); /* wait for multixact to end */ -if (nowait) +if (waitMode == WAITMODE_NORMAL) + MultiXactIdWait((MultiXactId) xwait, status, infomask, + relation, tuple-t_data-t_ctid, + XLTW_Lock, NULL); +else { if (!ConditionalMultiXactIdWait((MultiXactId) xwait, status, infomask, relation, tuple-t_data-t_ctid, XLTW_Lock, NULL)) - ereport(ERROR, + { + if (waitMode == WAITMODE_SKIP_LOCKED) + return result; + else if (waitMode == WAITMODE_NOWAIT) + ereport(ERROR, (errcode(ERRCODE_LOCK_NOT_AVAILABLE), errmsg(could not obtain lock on row in relation \%s\, RelationGetRelationName(relation; + } } -else - MultiXactIdWait((MultiXactId) xwait, status, infomask, - relation, tuple-t_data-t_ctid, - XLTW_Lock, NULL); /* if there are updates, follow the update chain */ if (follow_updates @@ -4479,17 +4490,22 @@ l3: else { /* wait for regular
Re: [HACKERS] btree_gist macaddr valgrind woes
On 05/16/2014 01:28 PM, Andres Freund wrote: Hi, After Heikki has fixed the bit btree_gist bugs my valgrind run shows a bug in macaddr. Presumably it's because the type is a fixed width type with a length of 6. I guess it's allocating stuff sizeof(macaddr) but then passes the result around as a Datum which doesn't work well. The complaints seem to be coming from all the varlen data types, too, not just macaddr: ... ==14219==by 0x6B74BF1: gbt_macad_compress (btree_macaddr.c:113) ... ==14250==by 0x6B75C45: gbt_text_picksplit (btree_text.c:215) ... ==14280==by 0x6B76063: gbt_bytea_picksplit (btree_bytea.c:143) ... ==14292==by 0x6B76591: gbt_bit_union (btree_bit.c:173) ... I'm not seeing those valgrind warnings when I run it. Can you give more details on how to reproduce? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
On 2014-05-16 17:08:40 +0300, Heikki Linnakangas wrote: On 05/16/2014 01:28 PM, Andres Freund wrote: Hi, After Heikki has fixed the bit btree_gist bugs my valgrind run shows a bug in macaddr. Presumably it's because the type is a fixed width type with a length of 6. I guess it's allocating stuff sizeof(macaddr) but then passes the result around as a Datum which doesn't work well. The complaints seem to be coming from all the varlen data types, too, not just macaddr: Hm, right. Didn't look very far yet. ... ==14219==by 0x6B74BF1: gbt_macad_compress (btree_macaddr.c:113) ... ==14250==by 0x6B75C45: gbt_text_picksplit (btree_text.c:215) ... ==14280==by 0x6B76063: gbt_bytea_picksplit (btree_bytea.c:143) ... ==14292==by 0x6B76591: gbt_bit_union (btree_bit.c:173) ... I'm not seeing those valgrind warnings when I run it. Can you give more details on how to reproduce? I've added a bit more support for valgrind, but I don't see that being relevant in those. Are you compiling with USE_VALGRIND? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
Hello, Can we get that fixed please? It seems rather bad behavior for pg_basebackup to fatal out because of the permissions on a backup file of all things. Instead, we should do WARNING and say skipped. JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
Hi, On 2014-05-16 07:28:42 -0700, Joshua D. Drake wrote: Can we get that fixed please? It seems rather bad behavior for pg_basebackup to fatal out because of the permissions on a backup file of all things. Instead, we should do WARNING and say skipped. Doesn't sound like a good idea to me. We'd need to have a catalog of common unimportant fileendings and such. We surely *do* want to error out when we fail to copy an important file. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, May 15, 2014 at 11:11 AM, Amit Kapila amit.kapil...@gmail.comwrote: Data with LWLOCK_STATS -- BufMappingLocks PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101 PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72 PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54 PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60 PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128 PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81 PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503 PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71 PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46 PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30 PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14 PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25 PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30 PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28 PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25 PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22 PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29 PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36 PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22 This data shows that after patch, there is no contention for BufFreeListLock, rather there is a huge contention around BufMappingLocks. I have checked that HEAD also has contention around BufMappingLocks. As per my analysis till now, I think reducing contention around BufFreelistLock is not sufficient to improve scalability, we need to work on reducing contention around BufMappingLocks as well. To reduce the contention around BufMappingLocks, I have tried the patch by just increasing the Number of Buffer Partitions, and it actually shows a really significant increase in scalability both due to reduced contention around BufFreeListLock and BufMappingLocks. The real effect of reducing contention around BufFreeListLock was hidden because the whole contention was shifted to BufMappingLocks. I have taken performance data for both HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the benefit of reducing contention around BufFreeListLock. This data has been taken using pgbench read only load (Select). Performance Data --- HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) + LOG2_NUM_LOCK_PARTITIONS(6)) V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) + LOG2_NUM_LOCK_PARTITIONS(6)) Similarly 128 means 128 buffer partitions shared_buffers= 8GB scale factor = 3000 RAM - 64GB Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 shared_buffers= 8GB scale factor = 1000 RAM - 64GB Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64 117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822 Observations - 1. There is increase of upto 5 times in performance for data that can fit in memory but not in shared buffers 2. Though there is a increase in performance by just increasing number of buffer partitions, but it doesn't scale well (especially see the case when partitions have increased to 128 from 64). I have verified that contention has reduced around BufMappingLocks by running the patch with LWLOCKS BufFreeListLock PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1 BufMappingLocks PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0 PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0 PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0 PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0 .. .. PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0 PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0 PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0 PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0 We can clearly notice that the number for *blk* has reduced significantly which shows that contention has reduced. The patch is still in a shape to prove the merit of idea and I have just changed the number of partitions so that if someone wants to verify the performance for similar load, it can be done by just applying the patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com scalable_buffer_eviction_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
[HACKERS] chr() is still too loose about UTF8 code points
Quite some time ago, we made the chr() function accept Unicode code points up to U+1F, which is the largest value that will fit in a 4-byte UTF8 string. It was pointed out to me though that RFC3629 restricted the original definition of UTF8 to only allow code points up to U+10 (for compatibility with UTF16). While that might not be something we feel we need to follow exactly, pg_utf8_islegal implements the checking algorithm specified by RFC3629, and will therefore reject points above U+10. This means you can use chr() to create values that will be rejected on dump and reload: u8=# create table tt (f1 text); CREATE TABLE u8=# insert into tt values(chr('x001f'::bit(32)::int)); INSERT 0 1 u8=# select * from tt; f1 (1 row) u8=# \copy tt to 'junk' COPY 1 u8=# \copy tt from 'junk' ERROR: 22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 0xbf 0xbf CONTEXT: COPY tt, line 1 LOCATION: report_invalid_encoding, wchar.c:2011 I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 05/16/2014 07:30 AM, Andres Freund wrote: Hi, On 2014-05-16 07:28:42 -0700, Joshua D. Drake wrote: Can we get that fixed please? It seems rather bad behavior for pg_basebackup to fatal out because of the permissions on a backup file of all things. Instead, we should do WARNING and say skipped. Doesn't sound like a good idea to me. We'd need to have a catalog of common unimportant fileendings and such. We surely *do* want to error out when we fail to copy an important file.they pg_hba.conf~ is not an important file. We know what files are important, especially in $PGDATA, they aren't variable, so why is pg_basebackup failing on a file it should know or care nothing about? JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
Joshua D. Drake j...@commandprompt.com writes: On 05/16/2014 07:30 AM, Andres Freund wrote: On 2014-05-16 07:28:42 -0700, Joshua D. Drake wrote: Can we get that fixed please? It seems rather bad behavior for pg_basebackup to fatal out because of the permissions on a backup file of all things. Instead, we should do WARNING and say skipped. Doesn't sound like a good idea to me. We'd need to have a catalog of common unimportant fileendings and such. We surely *do* want to error out when we fail to copy an important file.they pg_hba.conf~ is not an important file. Rather than blaming the messenger, you should be asking why there are files in $PGDATA that the server can't read. That's a recipe for trouble no matter what. Or in words of one syllable: this is a bug in your editor, not in Postgres. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 2014-05-16 08:13:04 -0700, Joshua D. Drake wrote: On 05/16/2014 07:30 AM, Andres Freund wrote: Hi, On 2014-05-16 07:28:42 -0700, Joshua D. Drake wrote: Can we get that fixed please? It seems rather bad behavior for pg_basebackup to fatal out because of the permissions on a backup file of all things. Instead, we should do WARNING and say skipped. Doesn't sound like a good idea to me. We'd need to have a catalog of common unimportant fileendings and such. We surely *do* want to error out when we fail to copy an important file.they pg_hba.conf~ is not an important file. Where do we know that from? We know what files are important, especially in $PGDATA, they aren't variable, so why is pg_basebackup failing on a file it should know or care nothing about? No, we don't necessarily. It'd e.g. bad to succeed if postgresql.conf includes another file and we fail when backing that up even though it's in the data directory. But even otherwise it'd be a non-neglegible amount of code to enumerate possibly important files (which wouldn't fully reliable. We can't access the catalogs). Code that's only there to work around a user doing something bad that's trivially fixable. Nah. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
Heikki Linnakangas hlinnakan...@vmware.com writes: On 05/16/2014 01:28 PM, Andres Freund wrote: Presumably it's because the type is a fixed width type with a length of 6. I guess it's allocating stuff sizeof(macaddr) but then passes the result around as a Datum which doesn't work well. I've not tried valgrinding to check, but for macaddr I suspect the blame can be laid at the feet of gbt_num_compress, which is creating a datum of actual size 2 * tinfo-size (ie, 12 bytes for macaddr). This is later stored into an index column of declared type gbtreekey16, so the tuple assembly code is expecting the datum to have size 16. AFAICS there is no direct connection between the sizes the C code knows about and the sizes of the datatypes declared as STORAGE options in btree_gist--1.0.sql. Probably a reasonable fix would be to add a field to gbtree_ninfo that specifies the index datum size, and then make gbt_num_compress palloc0 that size rather than 2 * tinfo-size. The complaints seem to be coming from all the varlen data types, too, not just macaddr: Dunno what's the problem for the varlena types, but that's a completely separate code path. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 05/16/2014 08:19 AM, Tom Lane wrote: pg_hba.conf~ is not an important file. Rather than blaming the messenger, you should be asking why there are files in $PGDATA that the server can't read. That's a recipe for trouble no matter what. Or in words of one syllable: this is a bug in your editor, not in Postgres. Hardly and shows a distinct lack of user space experience. It also shows how useless pg_basebackup can be. Basically you are saying, Well yeah, there is this rogue file that doesn't belong, fark it... we will blow away a 2TB base backup and make you start over because.. meh, pg_basebackup is lazy. Software is supposed to make our lives easier, not harder. I should be able to evaluate the errors for the conditions they create. This is why rsync is and for the forseeable future will be king for creating base backups. JD regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
Hi, On 2014-05-16 08:45:12 -0700, Joshua D. Drake wrote: Software is supposed to make our lives easier, not harder. I should be able to evaluate the errors for the conditions they create. This is why rsync is and for the forseeable future will be king for creating base backups. It's dangerous to ignore errors rsync errors other than 'file vanished'. This hardly is an argument for your position. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 05/16/2014 08:48 AM, Andres Freund wrote: Hi, On 2014-05-16 08:45:12 -0700, Joshua D. Drake wrote: Software is supposed to make our lives easier, not harder. I should be able to evaluate the errors for the conditions they create. This is why rsync is and for the forseeable future will be king for creating base backups. It's dangerous to ignore errors rsync errors other than 'file vanished'. This hardly is an argument for your position. Are you reading what I write? I said. I should be able to evaluate the errors for the conditions they create. I never suggested ignoring anything. The point is RSYNC gives me a chance at success, pg_basebackup does not (in respect to this specific condition). JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake j...@commandprompt.comwrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 2014-05-16 18:20:35 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake j...@commandprompt.comwrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. Don't think it'll show you that error - that mode only stats() files, right? So you'd need to add access() or open()s. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On Fri, May 16, 2014 at 6:25 PM, Andres Freund and...@2ndquadrant.comwrote: On 2014-05-16 18:20:35 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake j...@commandprompt.com wrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. Don't think it'll show you that error - that mode only stats() files, right? So you'd need to add access() or open()s. You're right, we don't. I thought we did, but was clearly remembering wrong. I guess we could add an access() call to that codepath though. Not sure if that's going to cause any real overhead compared to the rest of what we're doing anyway? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 2014-05-16 18:29:25 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 6:25 PM, Andres Freund and...@2ndquadrant.comwrote: On 2014-05-16 18:20:35 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake j...@commandprompt.com wrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. Don't think it'll show you that error - that mode only stats() files, right? So you'd need to add access() or open()s. You're right, we don't. I thought we did, but was clearly remembering wrong. I guess we could add an access() call to that codepath though. Not sure if that's going to cause any real overhead compared to the rest of what we're doing anyway? It's not free. But I don't think it'd seriously matter in comparison. But it doesn't protect you if the file is created during the backup - which as you know can take a long time. For example because somebody felt the need to increase wal_keep_segments. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
On 05/16/2014 06:05 PM, Tom Lane wrote: Quite some time ago, we made the chr() function accept Unicode code points up to U+1F, which is the largest value that will fit in a 4-byte UTF8 string. It was pointed out to me though that RFC3629 restricted the original definition of UTF8 to only allow code points up to U+10 (for compatibility with UTF16). While that might not be something we feel we need to follow exactly, pg_utf8_islegal implements the checking algorithm specified by RFC3629, and will therefore reject points above U+10. This means you can use chr() to create values that will be rejected on dump and reload: u8=# create table tt (f1 text); CREATE TABLE u8=# insert into tt values(chr('x001f'::bit(32)::int)); INSERT 0 1 u8=# select * from tt; f1 (1 row) u8=# \copy tt to 'junk' COPY 1 u8=# \copy tt from 'junk' ERROR: 22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 0xbf 0xbf CONTEXT: COPY tt, line 1 LOCATION: report_invalid_encoding, wchar.c:2011 I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? +1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10 with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 05/16/2014 09:20 AM, Magnus Hagander wrote: On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake j...@commandprompt.com mailto:j...@commandprompt.com wrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. That seems reasonable. JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
On 05/16/2014 12:43 PM, Heikki Linnakangas wrote: On 05/16/2014 06:05 PM, Tom Lane wrote: Quite some time ago, we made the chr() function accept Unicode code points up to U+1F, which is the largest value that will fit in a 4-byte UTF8 string. It was pointed out to me though that RFC3629 restricted the original definition of UTF8 to only allow code points up to U+10 (for compatibility with UTF16). While that might not be something we feel we need to follow exactly, pg_utf8_islegal implements the checking algorithm specified by RFC3629, and will therefore reject points above U+10. This means you can use chr() to create values that will be rejected on dump and reload: u8=# create table tt (f1 text); CREATE TABLE u8=# insert into tt values(chr('x001f'::bit(32)::int)); INSERT 0 1 u8=# select * from tt; f1 (1 row) u8=# \copy tt to 'junk' COPY 1 u8=# \copy tt from 'junk' ERROR: 22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 0xbf 0xbf CONTEXT: COPY tt, line 1 LOCATION: report_invalid_encoding, wchar.c:2011 I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? +1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10 with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story. My understanding us that U+10 is the highest legal Unicode code point anyway. So this is really just tightening our routines to make sure we don't produce an invalid value. We won't be disallowing anything that is legal Unicode. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
Andres Freund-3 wrote On 2014-05-16 18:29:25 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 6:25 PM, Andres Freund lt; andres@ gt;wrote: On 2014-05-16 18:20:35 +0200, Magnus Hagander wrote: On Fri, May 16, 2014 at 5:46 PM, Joshua D. Drake lt; jd@ gt; wrote: At a minimum: Check to see if there is going to be a permission error BEFORE the base backup begins: starting basebackup: checking perms: ERROR no access to pg_hba.conf~ base backup will fail That's pretty much what it does if you enable progress meter. I realize you don't necessarily want that one, but we could have a switch that still tells the server to measure the size, but not actually print the output? While it costs a bit of overhead to do that, that's certainly something that's a lot more safe than ignoring errors. Don't think it'll show you that error - that mode only stats() files, right? So you'd need to add access() or open()s. You're right, we don't. I thought we did, but was clearly remembering wrong. I guess we could add an access() call to that codepath though. Not sure if that's going to cause any real overhead compared to the rest of what we're doing anyway? It's not free. But I don't think it'd seriously matter in comparison. But it doesn't protect you if the file is created during the backup - which as you know can take a long time. For example because somebody felt the need to increase wal_keep_segments. Greetings, Andres Freund Can we simply backup the non-data parts of $PGDATA first then move onto the data-parts? For the files that we'd be dealing with it would be sufficiently quick to just try and fail, immediately, then check for all possible preconditions first. The main issue seems to be the case where the 2TB of data get backed-up and then a small 1k file blows away all that work. Lets do those 1k files first. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/pg-basebackup-could-not-get-transaction-log-end-position-from-server-FATAL-could-not-open-file-pg-hbd-tp5804225p5804257.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
Heikki Linnakangas hlinnakan...@vmware.com writes: On 05/16/2014 06:05 PM, Tom Lane wrote: I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? +1 for back-patching. A value that cannot be restored is bad, and I can't imagine any legitimate use case for producing a Unicode character larger than U+10 with chr(x), when the rest of the system doesn't handle it. Fully supporting such values might be useful, but that's a different story. Well, AFAICT the rest of the system does handle any code point up to U+1F. It's only pg_utf8_islegal that's being picky. So another possible answer is to weaken the check in pg_utf8_islegal. However, that could create interoperability concerns with other software, and as you say the use-case for larger values seems pretty thin. Actually, after re-reading the spec there's more to it than this: chr() will allow creating utf8 sequences that correspond to the surrogate-pair codes, which are expressly disallowed in UTF8 by the RFCs. Maybe we should apply pg_utf8_islegal to the result string rather than duplicating its checks? BTW, there are various places that have comments or ifdefd-out code anticipating possible future support of 5- or 6-byte UTF8 sequences, which were specified in RFC2279 but then rescinded by RFC3629. I guess as a matter of cleanup we should think about removing that stuff. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? The compatibility risks resemble those associated with the fixes for bug #9210, so I recommend HEAD only: http://www.postgresql.org/message-id/flat/20140220043940.ga3064...@tornado.leadboat.com -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
Noah Misch n...@leadboat.com writes: On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? The compatibility risks resemble those associated with the fixes for bug #9210, so I recommend HEAD only: http://www.postgresql.org/message-id/flat/20140220043940.ga3064...@tornado.leadboat.com While I'd be willing to ignore that risk so far as code points above 10 go, if we want pg_utf8_islegal to be happy then we will also have to reject surrogate-pair code points. It's not beyond the realm of possibility that somebody is intentionally generating such code points with chr(), despite the dump/reload hazard. So now I agree that this is sounding more like a major-version-only behavioral change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Hi, elog.c's log_line_prefix() processes %d with: case 'd': if (MyProcPort) { const char *dbname = MyProcPort-database_name; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); if (padding != 0) appendStringInfo(buf, %*s, padding, dbname); else appendStringInfoString(buf, dbname); } else if (padding != 0) appendStringInfoSpaces(buf, padding 0 ? padding : -padding); write_csvlog() uses similar logic. Unfortunately MyProcPort only exists in user initiated backends. It's imo pretty annoying that neither bgworkers nor autovacuum workers show the proper database in the log. Why don't we just populate a global variable in InitPostgres() once we're sure which database the backend is connected to? We could fill fake MyProcPorts, but that doesn't seem like a good idea to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Andres Freund and...@2ndquadrant.com writes: elog.c's log_line_prefix() processes %d with: case 'd': if (MyProcPort) { const char *dbname = MyProcPort-database_name; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); if (padding != 0) appendStringInfo(buf, %*s, padding, dbname); Not directly related to your gripe, but: where did this padding logic come from, and what prevents it from creating invalidly-encoded output by means of truncating multibyte characters in the middle? Not to mention that if glibc has a different idea of the prevailing encoding than we do, it is quite likely to mess up completely. We've been burnt badly enough by use of this coding technique that it should never have been accepted in someplace as critical as elog.c. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-16 14:02:44 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: elog.c's log_line_prefix() processes %d with: case 'd': if (MyProcPort) { const char *dbname = MyProcPort-database_name; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); if (padding != 0) appendStringInfo(buf, %*s, padding, dbname); Not directly related to your gripe, but: where did this padding logic come from, and what prevents it from creating invalidly-encoded output by means of truncating multibyte characters in the middle? Isn't that syntax just the *minimal* width? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
Tom Lane-2 wrote Noah Misch lt; noah@ gt; writes: On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote: I think this probably means we need to change chr() to reject code points above 10. Should we back-patch that, or just do it in HEAD? The compatibility risks resemble those associated with the fixes for bug #9210, so I recommend HEAD only: http://www.postgresql.org/message-id/flat/ 20140220043940.GA3064539@.leadboat While I'd be willing to ignore that risk so far as code points above 10 go, if we want pg_utf8_islegal to be happy then we will also have to reject surrogate-pair code points. It's not beyond the realm of possibility that somebody is intentionally generating such code points with chr(), despite the dump/reload hazard. So now I agree that this is sounding more like a major-version-only behavioral change. I would tend to agree on principle - though since this does fall in a grey-area does 9.4 qualify for this bug-fix. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/chr-is-still-too-loose-about-UTF8-code-points-tp5804232p5804270.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 14:02:44 -0400, Tom Lane wrote: Not directly related to your gripe, but: where did this padding logic come from, and what prevents it from creating invalidly-encoded output by means of truncating multibyte characters in the middle? Isn't that syntax just the *minimal* width? Ah, you're right, so sprintf shouldn't attempt to truncate the data anywhere. Nonetheless, this has created a hazard that wasn't there before: with any padding spec, sprintf has to determine the width-in-characters of the supplied string. If glibc thinks the data is invalid according to *its* idea of the prevailing encoding, it will do something we won't like. My recollection is it refuses to print anything at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] chr() is still too loose about UTF8 code points
David G Johnston david.g.johns...@gmail.com writes: Tom Lane-2 wrote While I'd be willing to ignore that risk so far as code points above 10 go, if we want pg_utf8_islegal to be happy then we will also have to reject surrogate-pair code points. It's not beyond the realm of possibility that somebody is intentionally generating such code points with chr(), despite the dump/reload hazard. So now I agree that this is sounding more like a major-version-only behavioral change. I would tend to agree on principle - though since this does fall in a grey-area does 9.4 qualify for this bug-fix. I don't think it's too late to change this in 9.4. The discussion was about whether to back-patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-16 14:51:01 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 14:02:44 -0400, Tom Lane wrote: Not directly related to your gripe, but: where did this padding logic come from, and what prevents it from creating invalidly-encoded output by means of truncating multibyte characters in the middle? Isn't that syntax just the *minimal* width? Ah, you're right, so sprintf shouldn't attempt to truncate the data anywhere. Nonetheless, this has created a hazard that wasn't there before: with any padding spec, sprintf has to determine the width-in-characters of the supplied string. Shouldn't we already have setup the correct locale at that point beside a couple of lines inside InitPostgres()? If glibc thinks the data is invalid according to *its* idea of the prevailing encoding, it will do something we won't like. My recollection is it refuses to print anything at all. Since this is just about things like database,user, application name, not the actual log message it's probably not too bad if that happens. I.e. still debuggable. The message itself is separately appended, so it should still go through unhindered. Nnote that I haven't been involved in the feature and haven't thought much about related issues... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
Andres Freund and...@2ndquadrant.com writes: After Heikki has fixed the bit btree_gist bugs my valgrind run shows a bug in macaddr. AFAICT none of these traces represent live bugs, but I've pushed fixes for them into HEAD. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
On 05/16/2014 06:43 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 05/16/2014 01:28 PM, Andres Freund wrote: Presumably it's because the type is a fixed width type with a length of 6. I guess it's allocating stuff sizeof(macaddr) but then passes the result around as a Datum which doesn't work well. I've not tried valgrinding to check, but for macaddr I suspect the blame can be laid at the feet of gbt_num_compress, which is creating a datum of actual size 2 * tinfo-size (ie, 12 bytes for macaddr). This is later stored into an index column of declared type gbtreekey16, so the tuple assembly code is expecting the datum to have size 16. Ah, I see. AFAICS there is no direct connection between the sizes the C code knows about and the sizes of the datatypes declared as STORAGE options in btree_gist--1.0.sql. Probably a reasonable fix would be to add a field to gbtree_ninfo that specifies the index datum size, and then make gbt_num_compress palloc0 that size rather than 2 * tinfo-size. ISTM the correct fix would be to define a gbtreekey12 data type and use that. That's not back-patchable though, and introducing a whole new type to save a few bytes is hardly worth it. What you did makes sense. The complaints seem to be coming from all the varlen data types, too, not just macaddr: Dunno what's the problem for the varlena types, but that's a completely separate code path. It seems to be a similar issue to what I fixed in the bit type earlier. When the lower+upper keys are stored as one varlen Datum, the padding bytes between them are not zeroed. With these datatypes (i.e. not varbit) it doesn't lead to any real errors, however, because the padding bytes are never read. Valgrind is just complaining that we're storing uninitialized data on disk, even though it's never looked at. Nevertheless, seems best to fix that, if only to silence Valgrind. (And you just beat me to it. Thanks!) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
Heikki Linnakangas hlinnakan...@vmware.com writes: On 05/16/2014 06:43 PM, Tom Lane wrote: Dunno what's the problem for the varlena types, but that's a completely separate code path. It seems to be a similar issue to what I fixed in the bit type earlier. When the lower+upper keys are stored as one varlen Datum, the padding bytes between them are not zeroed. With these datatypes (i.e. not varbit) it doesn't lead to any real errors, however, because the padding bytes are never read. Valgrind is just complaining that we're storing uninitialized data on disk, even though it's never looked at. Yeah, I came to the same conclusions. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
Heikki Linnakangas hlinnakan...@vmware.com writes: ISTM the correct fix would be to define a gbtreekey12 data type and use that. That's not back-patchable though, and introducing a whole new type to save a few bytes is hardly worth it. What you did makes sense. BTW, the *real* problem with all this stuff is that the gbtreekeyNN types are declared as having int alignment, even though some of the opclasses store double-aligned types in them. I imagine it's possible to provoke bus errors on machines that are picky about alignment. The first column of an index is safe enough because index tuples will be double-aligned anyway, but it seems like there's a hazard for lower-order columns. This is something we cannot fix compatibly :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: pg_basebackup: could not get transaction log end position from server: FATAL: could not open file ./pg_hba.conf~: Permission denied
On 05/16/2014 08:11 PM, David G Johnston wrote: Can we simply backup the non-data parts of $PGDATA first then move onto the data-parts? For the files that we'd be dealing with it would be sufficiently quick to just try and fail, immediately, then check for all possible preconditions first. The main issue seems to be the case where the 2TB of data get backed-up and then a small 1k file blows away all that work. Lets do those 1k files first. You'll still need to distinguish data and non-data parts somehow. One idea would be to backup any files in the top directory first, before recursing into the subdirectories. That would've caught the OP's case, and probably many other typical cases where you drop something unexpected into $PGDATA. You could still have something funny nested deep in the data directory, but that's much less common. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] btree_gist macaddr valgrind woes
Hi, On 2014-05-16 15:31:52 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 05/16/2014 06:43 PM, Tom Lane wrote: Dunno what's the problem for the varlena types, but that's a completely separate code path. It seems to be a similar issue to what I fixed in the bit type earlier. When the lower+upper keys are stored as one varlen Datum, the padding bytes between them are not zeroed. With these datatypes (i.e. not varbit) it doesn't lead to any real errors, however, because the padding bytes are never read. Valgrind is just complaining that we're storing uninitialized data on disk, even though it's never looked at. Yeah, I came to the same conclusions. Thanks for fixing, looks good here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 14:51:01 -0400, Tom Lane wrote: Ah, you're right, so sprintf shouldn't attempt to truncate the data anywhere. Nonetheless, this has created a hazard that wasn't there before: with any padding spec, sprintf has to determine the width-in-characters of the supplied string. Shouldn't we already have setup the correct locale at that point beside a couple of lines inside InitPostgres()? Define correct locale. Keep in mind that names in shared catalogs might be in any encoding. But according to previous research, we don't really guarantee that glibc thinks the encoding is what we think, anyway. The commit messages for 54cd4f04576833abc394e131288bf3dd7dcf4806 and ed437e2b27c48219a78f3504b0d05c17c2082d02 are relevant here. The second one suggests that only %.*s is at risk not %*s, but I'm not really convinced right now. My recollection is that glibc will abandon processing either format spec if it thinks the string is wrongly encoded. Since this is just about things like database,user, application name, not the actual log message it's probably not too bad if that happens. [ shrug... ] People *will* complain if data disappears from their logs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-16 15:44:53 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 14:51:01 -0400, Tom Lane wrote: Ah, you're right, so sprintf shouldn't attempt to truncate the data anywhere. Nonetheless, this has created a hazard that wasn't there before: with any padding spec, sprintf has to determine the width-in-characters of the supplied string. Shouldn't we already have setup the correct locale at that point beside a couple of lines inside InitPostgres()? Define correct locale. Keep in mind that names in shared catalogs might be in any encoding. Even the name of the database connected to? I've never checked but I had assumed it be valid in that database's encoding. But evidently not... Why doesn't name_text() have to do an encodign check btw? As there doesn't seem to be much input checking in namein it's a pretty easy way to get bogus data in. But according to previous research, we don't really guarantee that glibc thinks the encoding is what we think, anyway. The commit messages for 54cd4f04576833abc394e131288bf3dd7dcf4806 and ed437e2b27c48219a78f3504b0d05c17c2082d02 are relevant here. The second one suggests that only %.*s is at risk not %*s, but I'm not really convinced right now. My recollection is that glibc will abandon processing either format spec if it thinks the string is wrongly encoded. I've tested it here. From what it looks like it prints just fine but misjudges the width for padding. Since this is just about things like database,user, application name, not the actual log message it's probably not too bad if that happens. [ shrug... ] People *will* complain if data disappears from their logs. You propose to revert? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 15:44:53 -0400, Tom Lane wrote: But according to previous research, we don't really guarantee that glibc thinks the encoding is what we think, anyway. The commit messages for 54cd4f04576833abc394e131288bf3dd7dcf4806 and ed437e2b27c48219a78f3504b0d05c17c2082d02 are relevant here. The second one suggests that only %.*s is at risk not %*s, but I'm not really convinced right now. My recollection is that glibc will abandon processing either format spec if it thinks the string is wrongly encoded. I've tested it here. From what it looks like it prints just fine but misjudges the width for padding. Oh, good. We can live with that. You propose to revert? Not if the data gets printed. I was afraid it wouldn't be. (In any case, we could retain the feature and just do the padding computations ourselves. But I now think we don't have to.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-16 16:19:28 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-05-16 15:44:53 -0400, Tom Lane wrote: But according to previous research, we don't really guarantee that glibc thinks the encoding is what we think, anyway. The commit messages for 54cd4f04576833abc394e131288bf3dd7dcf4806 and ed437e2b27c48219a78f3504b0d05c17c2082d02 are relevant here. The second one suggests that only %.*s is at risk not %*s, but I'm not really convinced right now. My recollection is that glibc will abandon processing either format spec if it thinks the string is wrongly encoded. I've tested it here. From what it looks like it prints just fine but misjudges the width for padding. Oh, good. We can live with that. Yea, seems pretty harmless. It's a pretty new glibc version though, so things might have been different in the past. On the other hand I can see more justification to stop processing on input than output... (In any case, we could retain the feature and just do the padding computations ourselves. But I now think we don't have to.) Might actually make the code somewhat less ugly... But I am not volunteering. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CREATE REPLICATION SLOT fails on a timeout
I am finding that my logical walsender connections are being terminated due to a timeout on the CREATE REPLICATION SLOT command. with terminating walsender process due to replication timeout Below is the stack trace when this happens #3 0x0067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) at walsender.c:1748 #4 0x0067eedc in WalSndWaitForWal (loc=691727888) at walsender.c:1216 #5 logical_read_xlog_page (state=optimized out, targetPagePtr=691724288, reqLen=optimized out, targetRecPtr=optimized out, cur_page=0x18476e0 }\320\005, pageTLI=optimized out) at walsender.c:741 #6 0x004f41bf in ReadPageInternal (state=state@entry=0x17aa140, pageptr=pageptr@entry=691724288, reqLen=reqLen@entry=3600) at xlogreader.c:525 #7 0x004f454a in XLogReadRecord (state=0x17aa140, RecPtr=691727856, RecPtr@entry=0, errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228 #8 0x00675e5c in DecodingContextFindStartpoint (ctx=ctx@entry=0x17a0358) at logical.c:460 #9 0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at walsender.c:800 #10 exec_replication_command ( cmd_string=cmd_string@entry=0x17f1478 CREATE_REPLICATION_SLOT \slon_4_1\ LOGICAL \slony1_funcs.2.2.0\) at walsender.c:1291 #11 0x006bf4a1 in PostgresMain (argc=optimized out, argv=argv@entry=0x177db50, dbname=0x177db30 test1, (gdb) p last_reply_timestamp $1 = 0 I propose the attached patch sets last_reply_timestamp to now() it starts processing a command. Since receiving a command is hearing something from the client. Steve diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..56a2f10 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** exec_replication_command(const char *cmd *** 1276,1281 --- 1276,1282 parse_rc; cmd_node = replication_parse_result; + last_reply_timestamp = GetCurrentTimestamp(); switch (cmd_node-type) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
Hi, On 2014-05-16 16:37:16 -0400, Steve Singer wrote: I am finding that my logical walsender connections are being terminated due to a timeout on the CREATE REPLICATION SLOT command. with terminating walsender process due to replication timeout Below is the stack trace when this happens #3 0x0067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) at walsender.c:1748 #4 0x0067eedc in WalSndWaitForWal (loc=691727888) at walsender.c:1216 ... #9 0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at walsender.c:800 #10 exec_replication_command () at walsender.c:1291 #11 0x006bf4a1 in PostgresMain (argc=optimized out, argv=argv@entry=0x177db50, dbname=0x177db30 test1, (gdb) p last_reply_timestamp $1 = 0 I propose the attached patch sets last_reply_timestamp to now() it starts processing a command. Since receiving a command is hearing something from the client. Hm. Yes, that's a problem. diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..56a2f10 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** exec_replication_command(const char *cmd *** 1276,1281 --- 1276,1282 parse_rc; cmd_node = replication_parse_result; + last_reply_timestamp = GetCurrentTimestamp(); switch (cmd_node-type) { I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
On 05/16/2014 04:43 PM, Andres Freund wrote: Hi, I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. Greetings, Andres Freund That makes sense. A patch that does that is attached. Steve diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..d23f06b *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** WalSndCheckTimeOut(TimestampTz now) *** 1738,1744 timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); ! if (wal_sender_timeout 0 now = timeout) { /* * Since typically expiration of replication timeout means --- 1738,1744 timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); ! if (last_reply_timestamp 0 wal_sender_timeout 0 now = timeout) { /* * Since typically expiration of replication timeout means -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 16 May 2014 13:21, Craig Ringer cr...@2ndquadrant.com wrote: On 05/16/2014 04:46 PM, Craig Ringer wrote: I'll follow up with the patch and a git tree when it's ready, hopefully tonight. Here's a rebased version of Simon's original patch that runs on current master. I still need to merge the isolation tests for it merged and sorted out, and after re-reading it I'd like to change waitMode into an enum, not just some #defines . Hope it's useful for comparison and ideas. Thank you! At first glance they're sort of similar which is reassuring. I'm especially interested in the buffer release semantics which I was confused about and will look into that (to resolve the TODO notes in my patch). I noticed that in applyLockingClause, Simon has rc-waitMode |= waitMode. Is that right? The values are 0, 1, and 2, but if you had both NOWAIT and SKIP LOCKED somewhere in your query you could up with rc-waitMode == 3 unless I'm mistaken. In my patch I have code that would give precedence to NOWAIT, though looking at it again it could be simpler. (One reviewer pointed out, that it should really be a single unified enum. In fact I have been a bit unsure about what scope such an enumeration should have in the application -- could it even be used in parser code? I tried to follow existing examples which is why I used #define macros in gram.y). From a bikeshed colour point of view: * I used SKIP LOCKED DATA like DB2, and Simon used SKIP LOCKED like Oracle, and I guess shorter is sweeter * I used the term wait_policy and an enum, Simon used waitMode and an int * I had noWait and skipLocked travelling separately in some places, Simon had a single parameter, which is much better Best regards, Thomas Munro
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Hi, On 2014-05-16 19:54:56 +0200, Andres Freund wrote: Hi, elog.c's log_line_prefix() processes %d with: case 'd': if (MyProcPort) { const char *dbname = MyProcPort-database_name; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); if (padding != 0) appendStringInfo(buf, %*s, padding, dbname); else appendStringInfoString(buf, dbname); } else if (padding != 0) appendStringInfoSpaces(buf, padding 0 ? padding : -padding); write_csvlog() uses similar logic. Unfortunately MyProcPort only exists in user initiated backends. It's imo pretty annoying that neither bgworkers nor autovacuum workers show the proper database in the log. Why don't we just populate a global variable in InitPostgres() once we're sure which database the backend is connected to? We could fill fake MyProcPorts, but that doesn't seem like a good idea to me. The attached simple patch implements the former idea. We could probably replace a couple of get_database_name(MyDatabaseId) calls by it, but that doesn't look that important. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From e0b5bae6323f22494ba8aad79a50a8fd69a0d21c Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 16 May 2014 22:34:47 +0200 Subject: [PATCH] Support showing the database in log messages originating in a background process. --- src/backend/utils/error/elog.c| 11 +-- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 3 +++ src/include/miscadmin.h | 1 + 4 files changed, 14 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 0d92dcd..db6cc43 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -2320,9 +2320,14 @@ log_line_prefix(StringInfo buf, ErrorData *edata) padding 0 ? padding : -padding); break; case 'd': -if (MyProcPort) +if (MyProcPort || MyDatabaseName) { - const char *dbname = MyProcPort-database_name; + const char *dbname; + + if (MyProcPort) + dbname = MyProcPort-database_name; + else + dbname = MyDatabaseName; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); @@ -2577,6 +2582,8 @@ write_csvlog(ErrorData *edata) /* database name */ if (MyProcPort) appendCSVLiteral(buf, MyProcPort-database_name); + else if (MyDatabaseName) + appendCSVLiteral(buf, MyDatabaseName); appendStringInfoChar(buf, ','); /* Process id */ diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index be74835..46d4cf4 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -62,6 +62,7 @@ char postgres_exec_path[MAXPGPATH]; /* full path to backend */ BackendId MyBackendId = InvalidBackendId; Oid MyDatabaseId = InvalidOid; +char *MyDatabaseName = NULL; Oid MyDatabaseTableSpace = InvalidOid; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index ed936d7..0f5aaa8 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -51,6 +51,7 @@ #include utils/acl.h #include utils/fmgroids.h #include utils/guc.h +#include utils/memutils.h #include utils/pg_locale.h #include utils/portal.h #include utils/ps_status.h @@ -795,6 +796,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, MyDatabaseTableSpace = dbform-dattablespace; /* take database name from the caller, just for paranoia */ strlcpy(dbname, in_dbname, sizeof(dbname)); + MyDatabaseName = MemoryContextStrdup(TopMemoryContext, in_dbname); } else { @@ -815,6 +817,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, /* pass the database name back to the caller */ if (out_dbname) strcpy(out_dbname, dbname); + MyDatabaseName = MemoryContextStrdup(TopMemoryContext, dbname); } /* Now we can mark our PGPROC entry with the database ID */ diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c2b786e..91710d3 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -165,6 +165,7 @@ extern char postgres_exec_path[]; * extern BackendIdMyBackendId; */ extern PGDLLIMPORT Oid MyDatabaseId; +extern PGDLLIMPORT char *MyDatabaseName; extern PGDLLIMPORT Oid MyDatabaseTableSpace; -- 2.0.0.rc2.4.g1dc51c6.dirty -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.4 checksum error in recovery with btree index
More fun with my torn page injection test program on 9.4. 24171 2014-05-16 14:00:44.934 PDT:WARNING: 01000: page verification failed, calculated checksum 21100 but expected 3356 24171 2014-05-16 14:00:44.934 PDT:CONTEXT: xlog redo split_l: rel 1663/16384/16405 left 35191, right 35652, next 34666, level 0, firstright 192 24171 2014-05-16 14:00:44.934 PDT:LOCATION: PageIsVerified, bufpage.c:145 24171 2014-05-16 14:00:44.934 PDT:FATAL: XX001: invalid page in block 34666 of relation base/16384/16405 24171 2014-05-16 14:00:44.934 PDT:CONTEXT: xlog redo split_l: rel 1663/16384/16405 left 35191, right 35652, next 34666, level 0, firstright 192 24171 2014-05-16 14:00:44.934 PDT:LOCATION: ReadBuffer_common, bufmgr.c:483 I've seen this twice now, the checksum failure was both times for the block labelled next in the redo record. Is this another case where the block needs to be reinitialized upon replay? Cheers, Jeff
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
Andres Freund and...@2ndquadrant.com writes: It's imo pretty annoying that neither bgworkers nor autovacuum workers show the proper database in the log. Why don't we just populate a global variable in InitPostgres() once we're sure which database the backend is connected to? We could fill fake MyProcPorts, but that doesn't seem like a good idea to me. The attached simple patch implements the former idea. This is basically reintroducing a variable that used to exist and was intentionally gotten rid of, on the grounds that it'd fail to track any renaming of the session's current database (cf commit b256f24264). I'm not sure how much effort ought to be put into dealing with that corner case; but let's not reintroduce old bugs just for failure to remember them. The existing implementation of log line %d has the same issue of course, so this is not a very good argument not to change %d. It *is* a reason not to blindly change get_database_name(MyDatabaseId) calls, which were coded that way precisely for this reason. It might also be a reason not to blithely expose a global variable like this, which would encourage making of the same mistake in places that might be more critical than %d. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-16 17:48:28 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: It's imo pretty annoying that neither bgworkers nor autovacuum workers show the proper database in the log. Why don't we just populate a global variable in InitPostgres() once we're sure which database the backend is connected to? We could fill fake MyProcPorts, but that doesn't seem like a good idea to me. The attached simple patch implements the former idea. This is basically reintroducing a variable that used to exist and was intentionally gotten rid of, on the grounds that it'd fail to track any renaming of the session's current database (cf commit b256f24264). I'm not sure how much effort ought to be put into dealing with that corner case; but let's not reintroduce old bugs just for failure to remember them. I did look whether there's a race making MyDatabaseName out of date. I didn't see any. There's a windows where it could be renamed between where I'd done the MyDatabaseName = ... and the LockSharedObject() a few lines below. But directly afterwards there's a check that the database name is still correct. We could move the filling of MyDatabaseName to lessen the amount of comments that need to be added. It looks like the interlock around database names didn't exist back in b256f24264 - so the situation simply has changed. Looks like that has happend in 2667d56a3b489e5645f0. The existing implementation of log line %d has the same issue of course, so this is not a very good argument not to change %d. It *is* a reason not to blindly change get_database_name(MyDatabaseId) calls, which were coded that way precisely for this reason. It might also be a reason not to blithely expose a global variable like this, which would encourage making of the same mistake in places that might be more critical than %d. I think exposing the variable somewhat reasonable. It allows to to print the database name in situations where we'd normally not dare because of the syscache lookup... And given that it doesn't look racy anymore the danger seems pretty low. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] %d in log_line_prefix doesn't work for bg/autovacuum workers
On 2014-05-17 00:17:46 +0200, Andres Freund wrote: On 2014-05-16 17:48:28 -0400, Tom Lane wrote: This is basically reintroducing a variable that used to exist and was intentionally gotten rid of, on the grounds that it'd fail to track any renaming of the session's current database (cf commit b256f24264). I'm not sure how much effort ought to be put into dealing with that corner case; but let's not reintroduce old bugs just for failure to remember them. I did look whether there's a race making MyDatabaseName out of date. I didn't see any. There's a windows where it could be renamed between where I'd done the MyDatabaseName = ... and the LockSharedObject() a few lines below. But directly afterwards there's a check that the database name is still correct. We could move the filling of MyDatabaseName to lessen the amount of comments that need to be added. Moving it also provides the name in bootstrap mode. No idea whether there'll ever be a use in that but given the variable may later replace some get_database_name(MyDatabaseId) calls it seems to be a benefit. I'd briefly changed elog.c to only use MydatabaseName, thinking that there seems to be little reason to continue using database_name in elog.c once the variable is introduced. But that's not a good idea: MyProcPort-database_name exists earlier. It's imo a bit debatable whether it's a good idea to log database names in log_line_prefix that don't exist. But that's a separate change. I've changed it though so that MyDatabaseName is preferred over MyProc. It's imo easier to see that it has the correct value. And it'll be correct if we ever support mapping authentication and real database names or something. Revised patch attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 3dd4063e607b0800fe74d33188c3198961ada073 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 16 May 2014 22:34:47 +0200 Subject: [PATCH] Support showing the database in log entries issued by background processes. Until now %d in log_line_prefix and the database_name column in csvlog didn't display the database for autovacuum and background worker processes. Because logging should better not perform catalog accesses MyProcPort-database_name was used. The database name specified by the user when connecting. As that structure isn't used in background processes that aren't connected to clients log messages where using an empty/NULL database name respectively.. Add a new global variable MyDabaseName that contains the database name of the database a processes is connected to. Such a variable even used to exist, back in b256f24264, but it was removed due to concerns about it being out of date because of database renames. But since then 2667d56a3b added locking that prevents a database with connected users from being renamed. --- src/backend/utils/error/elog.c| 13 ++--- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 +++ src/include/miscadmin.h | 1 + 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 0d92dcd..41b93f3 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -2320,9 +2320,14 @@ log_line_prefix(StringInfo buf, ErrorData *edata) padding 0 ? padding : -padding); break; case 'd': -if (MyProcPort) +if (MyProcPort || MyDatabaseName) { - const char *dbname = MyProcPort-database_name; + const char *dbname; + + if (MyDatabaseName) + dbname = MyDatabaseName; + else + dbname = MyProcPort-database_name; if (dbname == NULL || *dbname == '\0') dbname = _([unknown]); @@ -2575,7 +2580,9 @@ write_csvlog(ErrorData *edata) appendStringInfoChar(buf, ','); /* database name */ - if (MyProcPort) + if (MyDatabaseName) + appendCSVLiteral(buf, MyDatabaseName); + else if (MyProcPort) appendCSVLiteral(buf, MyProcPort-database_name); appendStringInfoChar(buf, ','); diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index be74835..46d4cf4 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -62,6 +62,7 @@ char postgres_exec_path[MAXPGPATH]; /* full path to backend */ BackendId MyBackendId = InvalidBackendId; Oid MyDatabaseId = InvalidOid; +char *MyDatabaseName = NULL; Oid MyDatabaseTableSpace = InvalidOid; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index ed936d7..ca11ae4 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -51,6 +51,7 @@ #include utils/acl.h #include utils/fmgroids.h #include utils/guc.h +#include utils/memutils.h #include utils/pg_locale.h #include utils/portal.h #include utils/ps_status.h @@ -862,6
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
On 2014-05-16 17:02:33 -0400, Steve Singer wrote: I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. That makes sense. A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 324d74e16dd8ee2a0fa977d92a269fd43a746196 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sat, 17 May 2014 01:22:01 +0200 Subject: [PATCH] Don't pay heed to wal_sender_timeout while creating a decoding slot. Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for futher WAL using WalSndWaitForWal(). That used to always respect wal_sender_timeout and kill the session when waiting long enough because no feedback/ping messages can be sent while the slot is still being created. Add the notion that last_reply_timestamp = 0 means that the walsender currently doesn't need timeout processing and add checks for it in a couple of places. Bugreport and initial patch by Steve Singer, revised by Andres Freund. --- src/backend/replication/walsender.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 5c11d68..90394ce 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -148,9 +148,10 @@ static StringInfoData reply_message; static StringInfoData tmpbuf; /* - * Timestamp of the last receipt of the reply from the standby. + * Timestamp of the last receipt of the reply from the standby. Set to 0 if a + * the process currently shouldn't be killed by wal_sender_timeout. */ -static TimestampTz last_reply_timestamp; +static TimestampTz last_reply_timestamp = 0; /* Have we sent a heartbeat message asking for reply, since last reply? */ static bool waiting_for_ping_response = false; @@ -796,6 +797,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) logical_read_xlog_page, WalSndPrepareWrite, WalSndWriteData); + /* + * Signal that we don't need the timeout mechanism. We're just + * creating the replication slot and don't yet accept feedback + * messages or send keepalives. As we possibly need to wait for + * further WAL the walsender would possibly be killed too soon. + */ + last_reply_timestamp = 0; + /* build initial snapshot, might take a while */ DecodingContextFindStartpoint(ctx); @@ -1693,7 +1702,7 @@ WalSndComputeSleeptime(TimestampTz now) { long sleeptime = 1; /* 10 s */ - if (wal_sender_timeout 0) + if (wal_sender_timeout 0 last_reply_timestamp 0) { TimestampTz wakeup_time; long sec_to_timeout; @@ -1735,6 +1744,10 @@ WalSndCheckTimeOut(TimestampTz now) { TimestampTz timeout; + /* don't bail out if we're doing something that doesn't require timeouts */ + if (last_reply_timestamp = 0) + return; + timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); @@ -2879,7 +2892,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now) { TimestampTz ping_time; - if (wal_sender_timeout = 0) + /* + * Don't send keepalive messages if timeouts are globally disabled or + * we're doing something not partaking in timeouts. + */ + if (wal_sender_timeout = 0 || last_reply_timestamp = 0) return; if (waiting_for_ping_response) -- 2.0.0.rc2.4.g1dc51c6.dirty -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila amit.kapil...@gmail.comwrote: Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault. In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now: http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash I never got quite as far as trying to hook that up to the buffer mapping machinery, but maybe that would be worth doing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Scaling shared buffer eviction
On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.comwrote: shared_buffers= 8GB scale factor = 3000 RAM - 64GB Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 shared_buffers= 8GB scale factor = 1000 RAM - 64GB Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64 117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822 I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? What does Thrds denote? -- Peter Geoghegan
Re: [HACKERS] Scaling shared buffer eviction
On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan p...@heroku.com wrote: On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.com wrote: shared_buffers= 8GB scale factor = 3000 RAM - 64GB I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? Yes, the figures are tps for a 300 second run. It is for select-only transactions. What does Thrds denote? It denotes number of threads (-j in pgbench run) I have used below statements to take data ./pgbench -c 64 -j 64 -T 300 -S postgres ./pgbench -c 128 -j 128 -T 300 -S postgres The reason for posting the numbers for 64/128 threads is because we have mainly concurrency bottleneck when the number of connections are higher than CPU cores and I am using 16 cores, 64 hardware threads m/c. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Sat, May 17, 2014 at 6:02 AM, Robert Haas robertmh...@gmail.com wrote: I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault. In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. Is there anything bad except for may be increase in LWLocks with scaling hash partitions w.r.t to shared buffers either by auto tuning or by having a configuration knob. I understand that it would be bit difficult for users to estimate the correct value of such a parameter, we can provide info about its usage in docs such that if user increases shared buffers by 'X' (20 times) of default value (128M), then consider increasing such partitions and it should be always power of 2 or does something similar to above internally in code. I agree that may be even by having a reasonably good estimate of number of partitions w.r.t shared buffers, we might not be able to eliminate the contention around BufMappingLocks, but I think the scalability we get by doing that is not bad either. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now: I have still not read the complete code, but by just going through initial file header, it seems to me that it will be much better than current implementation in terms of concurrency, by the way does such an implementation can extend to reducing scalability for hash indexes as well? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 05/17/2014 05:24 AM, Thomas Munro wrote: I noticed that in applyLockingClause, Simon has rc-waitMode |= waitMode. Is that right? The values are 0, 1, and 2, but if you had both NOWAIT and SKIP LOCKED somewhere in your query you could up with rc-waitMode == 3 unless I'm mistaken. I do not think that |= is correct there. It may be that no case can arise where you get the bogus value, but since in all other places the values are tested for equalty not as bit fields ( waitMode == NOWAIT not waitMode NOWAIT ) it doesn't make sense to |= it there. ? In my patch I have code that would give precedence to NOWAIT, though looking at it again it could be simpler. I agree; if NOWAIT is present anywhere it should be preferred to SKIP_LOCKED, as it's OK to apply NOWAIT where SKIP LOCKED appears, but possibly semantically incorrect to apply SKIP LOCKED where only NOWAIT was expected. (One reviewer pointed out, that it should really be a single unified enum. In fact I have been a bit unsure about what scope such an enumeration should have in the application -- could it even be used in parser code? I tried to follow existing examples which is why I used #define macros in gram.y). Not sure there. From a bikeshed colour point of view: * I used SKIP LOCKED DATA like DB2, and Simon used SKIP LOCKED like Oracle, and I guess shorter is sweeter We have a long tradition of trying to allow noise keywords where it's harmless. So the clause should probably be SKIP LOCKED [DATA] in much the same way we have BEGIN [ WORK | TRANSACTION ] ... There won't be any ambiguity there. * I used the term wait_policy and an enum, Simon used waitMode and an int I prefer an enum and intended to change Simon's patch but didn't have the time. I have some isolation tester and regression tests that are still to follow. * I had noWait and skipLocked travelling separately in some places, Simon had a single parameter, which is much better Yes, I strongly prefer that. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers