Re: [HACKERS] Minmax indexes
On Thu, September 26, 2013 00:34, Erik Rijkers wrote: On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: [minmax-5.patch] I have the impression it's not quite working correctly. The attached program returns different results for different values of enable_bitmapscan (consistently). ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. ) From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields only the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached program). Looking back at that, I realize I should have added a bit more detail on that test.sh program and its output (attached on previous mail). test.sh creates a table tmm and a minmax index on that table: testdb=# \d tmm Table public.tmm Column | Type | Modifiers +-+--- i | integer | r | integer | Indexes: tmm_minmax_idx minmax (r) The following shows the problem: the same search with minax index on versus off gives different result sets: testdb=# set enable_bitmapscan=0; select count(*) from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.473 ms count --- 1261 (1 row) Time: 7.764 ms testdb=# set enable_bitmapscan=1; select count(*) from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.471 ms count --- 3 (1 row) Time: 1.014 ms testdb=# set enable_bitmapscan =1; select * from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.615 ms i | r --+--- 9945 | 45405603 9951 | 102552485 9966 | 63763962 (3 rows) Time: 0.984 ms testdb=# set enable_bitmapscan=0; select * from ( select * from tmm where r between symmetric 19494484 and 145288238 order by i desc limit 10) f order by i ; SET Time: 0.470 ms i | r --+--- 9852 | 114996906 9858 | 69907169 9875 | 43341583 9894 | 127862657 9895 | 44740033 9911 | 51797553 9916 | 58538774 9945 | 45405603 9951 | 102552485 9966 | 63763962 (10 rows) Time: 8.704 ms testdb=# If enable_bitmapscan=1 (i.e. using the minmax index), then only some values are retrieved (in this case 3 rows). It turns out those are always the last N rows of the full resultset (i.e. with enable_bitmapscan=0). Erikjan Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] bitmap indexes
On 09/26/2013 12:20 AM, Alvaro Herrera wrote: This led me to research how these indexes are stored. I note that what we're doing here is to create another regular table and a btree index on top of it, and those are the ones that actually store the index data. This seems grotty and completely unlike the way we do things elsewhere (compare GIN indexes which have rbtrees inside them). Perhaps you meant that GIN has B-tree inside. RBTree is in fact used by GiST, but only as in-memory structure during the search - to get the tuples sorted by distance. // Antonin Houska (Tony) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench progress report improvements - split 3 v2
My feelings on the patch split haven't changed; your three bullet points call for four separate patches. Conflicting patches are bad, but dependent patches are okay; Indeed, this is the only solution if you do not want one patch. Note that it will not possible to backtrack one of the patch but the last one without conflicts. just disclose the dependency order. How about this: as a next step, please extract just this feature that I listed last Saturday: Patch (4): Redefine latency as reported by pgbench and report lag more. Once that's committed, we can move on to others. Ok, I'll submit a first part, hopefully today, possibly the one you suggest, about fixing and extending latency measure under --rate and reporting it under progress. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgbench filler columns
While looking at the compressibility of WAL files generated by pgbench, which is close to 90%, I first thought its because of the filler column in the accounts table. But a comment in pgbench.c says: /* * Note: TPC-B requires at least 100 bytes per row, and the filler * fields in these table declarations were intended to comply with that. * But because they default to NULLs, they don't actually take any space. * We could fix that by giving them non-null default values. However, that * would completely break comparability of pgbench results with prior * versions. Since pgbench has never pretended to be fully TPC-B * compliant anyway, we stick with the historical behavior. */ The comment about them being NULL and hence not taking up any space is added by commit b7a67c2840f193f in response to this bug report: http://www.postgresql.org/message-id/200710170810.l9h8a76i080...@wwwmaster.postgresql.org But I find it otherwise. On my machine, accounts table can only fit 62 tuples in a page with default fillfactor. The following queries show that filler column is NOT NULL, but set to empty string. I have tested on 8.2, 8.4 and master and they all show the same behavior. So I don't know if that bug report itself was wrong or if I am reading the comment wrong. postgres=# select count(*) from pgbench_accounts where filler IS NULL; count --- 0 (1 row) postgres=# select count(*) from pgbench_accounts where filler = ''; count 10 (1 row) Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] [PATCH] bitmap indexes
At 2013-09-25 19:20:17 -0300, alvhe...@2ndquadrant.com wrote: Here are some quick items while skimming this patch. Great, that gives me plenty to work on. At this point, I think it's appropriate to mark this patch as returned with feedback (which I will do), since the changes needed seem fairly major. I'll submit a revised patch for the next commitfest. Many thanks to everyone who tried the patch and commented. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI freezing bug
On 23.09.2013 01:07, Hannu Krosing wrote: On 09/20/2013 12:55 PM, Heikki Linnakangas wrote: Hi, Prompted by Andres Freund's comments on my Freezing without Write I/O patch, I realized that there's there's an existing bug in the way predicate locking handles freezing (or rather, it doesn't handle it). When a tuple is predicate-locked, the key of the lock is ctid+xmin. However, when a tuple is frozen, its xmin is changed to FrozenXid. That effectively invalidates any predicate lock on the tuple, as checking for a lock on the same tuple later won't find it as the xmin is different. Attached is an isolationtester spec to demonstrate this. The case is even fishier than that. That is, you can get bad behaviour on at least v9.2.4 even without VACUUM FREEZE. You just need to run permutation r1 r2 w1 w2 c1 c2 twice in a row. the first time it does get serialization error at c2 but the 2nd time both c1 and c2 complete successfully Oh, interesting. I did some debugging on this: there are actually *two* bugs, either one of which alone is enough to cause this on its own: 1. in heap_hot_search_buffer(), the PredicateLockTuple() call is passed wrong offset number. heapTuple-t_self is set to the tid of the first tuple in the chain that's visited, not the one actually being read. 2. CheckForSerializableConflictIn() uses the tuple's t_ctid field instead of t_self to check for exiting predicate locks on the tuple. If the tuple was updated, but the updater rolled back, t_ctid points to the aborted dead tuple. After fixing both of those bugs, running the test case twice in a row works, ie. causes a conflict and a rollback both times. Anyone see a problem with this? That still leaves the original problem I spotted, with freezing; that's yet another unrelated bug. - Heikki diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index ead3d69..a21f31b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1688,6 +1688,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, at_chain_start = first_call; skip = !first_call; + heapTuple-t_self = *tid; + /* Scan through possible multiple members of HOT-chain */ for (;;) { @@ -1717,7 +1719,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer, heapTuple-t_data = (HeapTupleHeader) PageGetItem(dp, lp); heapTuple-t_len = ItemIdGetLength(lp); heapTuple-t_tableOid = RelationGetRelid(relation); - heapTuple-t_self = *tid; + ItemPointerSetOffsetNumber(heapTuple-t_self, offnum); /* * Shouldn't see a HEAP_ONLY tuple at chain start. diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index d656d62..50583b3 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -4282,8 +4282,8 @@ CheckForSerializableConflictIn(Relation relation, HeapTuple tuple, SET_PREDICATELOCKTARGETTAG_TUPLE(targettag, relation-rd_node.dbNode, relation-rd_id, - ItemPointerGetBlockNumber((tuple-t_data-t_ctid)), - ItemPointerGetOffsetNumber((tuple-t_data-t_ctid)), + ItemPointerGetBlockNumber((tuple-t_self)), +ItemPointerGetOffsetNumber((tuple-t_self)), HeapTupleHeaderGetXmin(tuple-t_data)); CheckTargetForConflictsIn(targettag); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench filler columns
On Thu, Sep 26, 2013 at 2:05 PM, Pavan Deolasee pavan.deola...@gmail.comwrote: While looking at the compressibility of WAL files generated by pgbench, which is close to 90%, I first thought its because of the filler column in the accounts table. But a comment in pgbench.c says: /* * Note: TPC-B requires at least 100 bytes per row, and the filler * fields in these table declarations were intended to comply with that. * But because they default to NULLs, they don't actually take any space. * We could fix that by giving them non-null default values. However, that * would completely break comparability of pgbench results with prior * versions. Since pgbench has never pretended to be fully TPC-B * compliant anyway, we stick with the historical behavior. */ The comment about them being NULL and hence not taking up any space is added by commit b7a67c2840f193f in response to this bug report: http://www.postgresql.org/message-id/200710170810.l9h8a76i080...@wwwmaster.postgresql.org On a more careful look, it seems the original bug report complained about all tables except accounts. And all other tables indeed have filler as NULL. But the way comment is written it seems as if it applies to all DDLs. Should we just fix the comment and say its applicable for all tables except accounts ? Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid 2) now wait for all transactions that potentially still can use the index for *querying* to finish. During that indisready *must* be true, otherwise the index will have outdated contents. 3) Mark the index as indislive = false, indisready = false. Anything using a newer relcache entry will now not update the index. 4) Wait till all potential updaters of the index have finished. 5) Drop the index. With the patch's current scheme concurrent queries that use plans using the old index will get wrong results (at least in read committed) because concurrent writers will not update it anymore since it's marked indisready = false. This isn't a problem of the *new* index, it's a problem of the *old* one. Am I missing something? 3) I am not sure if the swap algorithm used now actually is correct either. We have mvcc snapshots now, right, but we're still potentially taking separate snapshot for individual relcache lookups. What's stopping us from temporarily ending up with two relcache entries with the same relfilenode? Previously you swapped names - I think that might end up being easier, because having names temporarily confused isn't as bad as two indexes manipulating the same file. Actually, performing swap operation with names proves to be more difficult than it looks as it makes necessary a moment where both the old and new indexes are marked as valid for all the backends. But that doesn't make the current method correct, does it? The only reason for that is that index_set_state_flag assumes that a given xact has not yet done any transactional update when it is called, forcing to one the number of state flag that can be changed inside a transaction. This is a safe method IMO, and we shouldn't break that. Part of that reasoning comes from the non-mvcc snapshot days, so it's not really up to date anymore. Even if you don't want to go through all that logic - which I'd understand quite well - you can just do it like: 1) start with: old index: valid, ready, live; new index: invalid, ready, live 2) one transaction: switch names from real_name = tmp_name, new_name = real_name 3) one transaction: mark real_name (which is the rebuilt index) as valid, and new_name as invalid Now, if we fail in the midst of 3, we'd have two indexes marked as valid. But that's unavoidable as long as you don't want to use transactions. Alternatively you could pass in a flag to use transactional updates, that should now be safe. At least, unless the old index still has indexcheckxmin = true with an xmin that's not old enough. But in that case we cannot do the concurrent reindex at all I think since we rely on the old index to be full valid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench - exclude pthread_create() from connection start timing
pgbench changes, when adding the throttling stuff. Having the start time taken when the thread really starts is just sanity, and I needed that just to rule out that it was the source of the strange measures. I don't get it; why is taking the time just after pthread_create() more sane than taking it just before pthread_create()? Thread create time seems to be expensive as well, maybe up 0.1 seconds under some conditions (?). Under --rate, this create delay means that throttling is laging behind schedule by about that time, so all the first transactions are trying to catch up. typically far more expensive than pthread_create(). The patch for threaded pgbench made the decision to account for pthread_create() as though it were part of establishing the connection. You're proposing to not account for it all. Both of those designs are reasonable to me, but I do not comprehend the benefit you anticipate from switching from one to the other. -j 800 vs -j 100 : ITM that if I you create more threads, the time delay incurred is cumulative, so the strangeness of the result should worsen. Not in general; we do one INSTR_TIME_SET_CURRENT() per thread, just before calling pthread_create(). However, thread 0 is a special case; we set its start time first and actually start it last. Your observation of cumulative delay fits those facts. Yep, that must be thread 0 which has a very large delay. I think it is simpler that each threads record its start time when it has started, without exception. Initializing the thread-0 start time later, just before calling its threadRun(), should clear this anomaly without changing other aspects of the measurement. Always taking the thread start time when the thread is started does solve the issue as well, and it is homogeneous for all cases, so the solution I suggest seems reasonable and simple. While pondering this area of the code, it occurs to me -- shouldn't we initialize the throttle rate trigger later, after establishing connections and sending startup queries? As it stands, we build up a schedule deficit during those tasks. Was that deliberate? On the principle, I agree with you. The connection creation time is another thing, but it depends on the options set. Under some options the connection is open and closed for every transaction, so there is no point in avoiding it in the measure or in the scheduling, and I want to avoid having to distinguish those cases. Morover, ISTM that one of the thread reuse the existing connection while other recreate is. So I left it as is. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. Yes. I've described how it *has* to be done, not how it's done. The current method of going straight to isready = false for the original index will result in wrong results because it's not updated anymore while it's still being used. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote: On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian br...@momjian.us wrote: This thread had a lot of discussion about bloating. I wonder, does the code check to see if there is a matching row _before_ adding any data? That's pretty much what the patch does. So, I guess my question is if we are only bloating on a contended operation, do we expect that to happen so much that bloat is a problem? I think the big objection to the patch is the additional code complexity and the potential to slow down other sessions. If it is only bloating on a contended operation, are these two downsides worth avoiding the bloat? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. Yes. I've described how it *has* to be done, not how it's done. The current method of going straight to isready = false for the original index will result in wrong results because it's not updated anymore while it's still being used. The index being dropped at the end of process is not the former index, but the concurrent index. The index used after REINDEX CONCURRENTLY is the old index but with the new relfilenode. Am I lacking of caffeine? It looks so... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao masao.fu...@gmail.com wrote: Hmm... when synchronous_transfer is set to data_flush, IMO the intuitive behaviors are (1) synchronous_commit = on A data flush should wait for the corresponding WAL to be flushed in the standby (2) synchronous_commit = remote_write A data flush should wait for the corresponding WAL to be written to OS in the standby. (3) synchronous_commit = local (4) synchronous_commit = off A data flush should wait for the corresponding WAL to be written locally in the master. I thought synchronous_commit and synchronous_transfer are kind of orthogonal to each other. synchronous_commit only controls whether and how to wait for the standby only when a transaction commits. synchronous_transfer OTOH tells how to interpret the standby listed in synchronous_standbys parameter. If set to commit then they are synchronous standbys (like today). If set to data_flush, they are asynchronous failback safe standby and if set to all then they are synchronous failback safe standbys. Well, its confusing :-( So IMHO in the current state of things, the synchronous_transfer GUC can not be changed at a session/transaction level since all backends, including background workers must honor the settings to guarantee failback safety. synchronous_commit still works the same way, but is ignored if synchronous_transfer is set to data_flush because that effectively tells us that the standbys listed under synchronous_standbys are really *async* standbys with failback safety. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-09-26 20:47:33 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. Yes. I've described how it *has* to be done, not how it's done. The current method of going straight to isready = false for the original index will result in wrong results because it's not updated anymore while it's still being used. The index being dropped at the end of process is not the former index, but the concurrent index. The index used after REINDEX CONCURRENTLY is the old index but with the new relfilenode. That's not relevant unless I miss something. After phase 4 both indexes are valid (although only the old one is flagged as such), but due to the switching of the relfilenodes backends could have either of both open, depending on the time they built the relcache entry. Right? Then you go ahead and mark the old index - which still might be used! - as dead in phase 5. Which means other backends might (again, depending on the time they have built the relcache entry) not update it anymore. In read committed we very well might go ahead and use the index with the same plan as before, but with a new snapshot. Which now will miss entries. Am I misunderstanding the algorithm you're using? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] setting separate values of replication parameters to each standby to provide more granularity
Hi, How about providing more granularity to replication, by setting separate values of replication parameters to each standby for example: standby1.wal_sender_timeout= 50s standby2.wal_sender_timeout= 40s The idea is to allow configuration of standby servers such that they have there own set of replication parameters as per requirements. If they are not specified then configure standby with default settings. I have already raised this point when I made proposal about introducing the ini file, but there was no feedback on this point (setting separate values to each standby) here is the link for that: http://www.postgresql.org/message-id/caf8q-gxvg38m4vvghuntw4nvu69byfj5n--qmounc-bwv8r...@mail.gmail.com As per discussion use of ini file is not necessary. Currently PostgreSQL provides infrastructure which allows level-2 nesting ( If we need further nesting of parameters we can use Andres patch http://www.postgresql.org/message-id/20130225211533.gd3...@awork2.anarazel.de ) we can do setting as follows: --postgresql.conf-- st1.wal_sender_timeout = 40s st2.wal_sender_timeout = 50s --- postgres=# show st1.wal_sender_timeout ; st1.wal_sender_timeout 40s (1 row) postgres=# show st2.wal_sender_timeout ; st2.wal_sender_timeout 50s (1 row) But this just handles the parser stage, we need to write the underlying code which actually configures standby server with corresponding parameter values. I think that use of postgresql.conf is much better option to do this. one can easily do argument about the size and complexity of postgresql.conf if it allows such setting. In that case we can use *include* directive and separate out replication parameters to another sub file. But after all of this, I think it will be a good change and will provide more granularity to the replication. Greetings, Samrat Revagade, NTT DATA OSS Center Pune, India.
Re: [HACKERS] Patch for fast gin cache performance improvement
Hi Ian, This patch contains a performance improvement for the fast gin cache. As you may know, the performance of the fast gin cache decreases with its size. Currently, the size of the fast gin cache is tied to work_mem. The size of work_mem can often be quite high. The large size of work_mem is inappropriate for the fast gin cache size. Therefore, we created a separate cache size called gin_fast_limit. This global variable controls the size of the fast gin cache, independently of work_mem. Currently, the default gin_fast_limit is set to 128kB. However, that value could need tweaking. 64kB may work better, but it's hard to say with only my single machine to test on. On my machine, this patch results in a nice speed up. Our test queries improve from about 0.9 ms to 0.030 ms. Please feel free to use the test case yourself: it should be attached. I can look into additional test cases (tsvectors) if anyone is interested. In addition to the global limit, we have provided a per-index limit: fast_cache_size. This per-index limit begins at -1, which means that it is disabled. If the user does not specify a per-index limit, the index will simply use the global limit. I had a look over this patch. I think this patch is interesting and very useful. Here are my review points: 1. Patch applies cleanly. 2. make, make install and make check is good. 3. I did performance evaluation using your test queries with 64kB and 128kB of gin_fast_limit (or fast_cache_size), and saw that both values achieved the performance gains over gin_fast_limit = '256MB'. 64kB worked better than 128kB. 64kB improved from 1.057 ms to 0.075 ms. Great! 4. In my understanding, the small value of gin_fast_limit/fast_cache_size leads to the increase in GIN search performance, which, however, leads to the decrease in GIN update performance. Am I right? If so, I think the tradeoff should be noted in the documentation. 5. The following documents in Chapter 57. GIN Indexes need to be updated: * 57.3.1. GIN Fast Update Technique * 57.4. GIN Tips and Tricks 6. I would like to see the results for the additional test cases (tsvectors). 7. The commented-out elog() code should be removed. 8. I think there are no issues in this patch. However, I have one question: how this patch works in the case where gin_fast_limit/fast_cache_size = 0? In this case, in my understanding, this patch inserts new entries into the pending list temporarily and immediately moves them to the main GIN data structure using ginInsertCleanup(). Am I right? If so, that is obviously inefficient. Sorry for the delay. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 07:43:15AM -0400, Bruce Momjian wrote: On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote: On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian br...@momjian.us wrote: This thread had a lot of discussion about bloating. I wonder, does the code check to see if there is a matching row _before_ adding any data? That's pretty much what the patch does. So, I guess my question is if we are only bloating on a contended operation, do we expect that to happen so much that bloat is a problem? I think the big objection to the patch is the additional code complexity and the potential to slow down other sessions. If it is only bloating on a contended operation, are these two downsides worth avoiding the bloat? Also, this isn't like the case where we are incrementing sequences --- I am unclear what workload is going to cause a lot of contention. If two sessions try to insert the same key, there will be bloat, but later upsert operations will already see the insert and not cause any bloat. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote: On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund and...@2ndquadrant.com wrote: Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. I like the concept and have no improvements on the name. Are there any other likely candidates for inclusion in that directory other than this stuff? pg_xlog Not sure whether it's sensible to only LOG in these cases. After all there's something unexpected happening. The robustness argument doesn't count since we're already shutting down. I see no point in throwing an error. The fact that we're having trouble cleaning up one dynamic shared memory segment doesn't mean we shouldn't try to clean up others, or that any remaining postmaster shutdown hooks shouldn't be executed. Well, it means we'll do a regular shutdown instead of PANICing and *not* writing a checkpoint. If something has corrupted our state to the point we cannot unregister shared memory we registered, something has gone terribly wrong. Quite possibly we've scribbled over our control structures or such. In that case it's not proper to do a normal shutdown, we're quite possibly writing bad data. I have to admit I didn't consider the possibility of an otherwise-clean shutdown that hit only this problem. I'm not sure how seriously to take that case. I guess we could emit warnings for individual failures and then throw an error at the end if there were 0, but that seems a little ugly. Or we could go whole hog and treat any failure as a critical error. Anyone else have an opinion on what to do here? There's extensive precedent in our code for LOG, WARNING, or even ignoring the return value of unlink(). (To my surprise, ignoring the return value is the most popular choice.) Of the dozens of backend callers, here is the mixed bag that actually raises ERROR or better: do_pg_stop_backup() RestoreArchivedFile() KeepFileRestoredFromArchive() create_tablespace_directories() [remove old symlink during recovery] destroy_tablespace_directories() RelationCacheInitFilePreInvalidate() CreateLockFile() I think it's awfully unlikely that runaway code would corrupt shared_buffers AND manage to make an unlink() fail. Imo that's a PANIC or at the very least a FATAL. Sure, that's a tempting option, but it doesn't seem to serve any very necessary point. There's no data corruption problem if we proceed here. Most likely either (1) there's a bug in the code, which panicking won't fix or (2) the DBA hand-edited the state file, in which case maybe he shouldn't have done that, but if he thinks the best way to recover from that is a cluster-wide restart, he can do that himself. There's no data corruption problem if we proceed - but there likely has been one leading to the current state. +1 for making this one a PANIC, though. With startup behind us, a valid dsm state file pointed us to a control segment with bogus contents. The conditional probability of shared memory corruption seems higher than that of a DBA editing the dsm state file of a running cluster to incorrectly name as the dsm control segment some other existing shared memory segment. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench filler columns
On Thu, Sep 26, 2013 at 03:23:30PM +0530, Pavan Deolasee wrote: On Thu, Sep 26, 2013 at 2:05 PM, Pavan Deolasee pavan.deola...@gmail.comwrote: While looking at the compressibility of WAL files generated by pgbench, which is close to 90%, I first thought its because of the filler column in the accounts table. But a comment in pgbench.c says: /* * Note: TPC-B requires at least 100 bytes per row, and the filler * fields in these table declarations were intended to comply with that. * But because they default to NULLs, they don't actually take any space. * We could fix that by giving them non-null default values. However, that * would completely break comparability of pgbench results with prior * versions. Since pgbench has never pretended to be fully TPC-B * compliant anyway, we stick with the historical behavior. */ The comment about them being NULL and hence not taking up any space is added by commit b7a67c2840f193f in response to this bug report: http://www.postgresql.org/message-id/200710170810.l9h8a76i080...@wwwmaster.postgresql.org On a more careful look, it seems the original bug report complained about all tables except accounts. And all other tables indeed have filler as NULL. But the way comment is written it seems as if it applies to all DDLs. Agreed. Should we just fix the comment and say its applicable for all tables except accounts ? Please do. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator
On Tue, Sep 24, 2013 at 3:22 PM, Stephen Frost sfr...@snowman.net wrote: * Robert Haas (robertmh...@gmail.com) wrote: Now I admit that's an arguable point. We could certainly define an intermediate notion of equality that is more equal than whatever = does, but not as equal as exact binary equality. I suggested it up-thread and don't recall seeing a response, so here it is again- passing the data through the binary-out function for the type and comparing *that* would allow us to change the interal binary representation of data types and would be something which we could at least explain to users as to why X isn't the same as Y according to this binary operator. Sorry, I missed that suggestion. I'm wary of inventing a completely new way of doing this. I don't think that there's any guarantee that the send/recv functions won't expose exactly the same implementation details as a direct check for binary equality. For example, array_send() seems to directly reveal the presence or absence of a NULL bitmap. Even if there were no such anomalies today, it feels fragile to rely on a fairly-unrelated concept to have exactly the semantics we want here, and it will surely be much slower. Binary equality has existing precedent and is used in numerous places in the code for good reason. Users might be confused about the use of those semantics in those places also, but AFAICT nobody is. I think the best thing I can say about this proposal is that it would clearly be better than what we're doing now, which is just plain wrong. I don't think it's the best proposal, however. It is obviously true, and unavoidable, that if the query can produce more than one result set depending on the query plan or other factors, then the materialized view contents can match only one of those possible result sets. But you are arguing that it should be allowed to produce some OTHER result set, that a direct execution of the query could *never* have produced, and I can't see how that can be right. I agree that the matview shouldn't produce things which *can't* exist through an execution of the query. I don't intend to argue against that but I realize that's a fallout of not accepting the patch to implement the binary comparison operators. My complaint is more generally that if this approach needs such then there's going to be problems down the road. No, I can't predict exactly what they are and perhaps I'm all wet here, but this kind of binary-equality operations are something I've tried to avoid since discovering what happens when you try to compare a couple of floating point numbers. So, I get that. But what I think is that the problem that's coming up here is almost the flip side of that. If you are working with types that are a little bit imprecise, such as floats or citext or box, you want to use comparison strategies that have a little bit of tolerance for error. In the case of box, this is actually built into the comparison operator. In the case of floats, it's not; you as the application programmer have to deal with the fact that comparisons are imprecise - like by avoiding equality comparisons. On the other hand, if you are *replicating* those data types, then you don't want that tolerance. If you're checking whether two boxes are equal, you may indeed want the small amount of fuzziness that our comparison operators allow. But if you're copying a box or a float from one table to another, or from one database to another, you want the values copied exactly, including all of those low-order bits that tend to foul up your comparisons. That's why float8out() normally doesn't display any extra_float_digits - because you as the user shouldn't be relying on them - but pg_dump does back them up because not doing so would allow errors to propagate. Similarly here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] bitmap indexes
On Wed, Sep 25, 2013 at 3:20 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Hi, Here are some quick items while skimming this patch. I am looking at commit 6448de29d from your github repo, branch bmi. What's with the pg_bitmapindex stuff in pg_namespace.h? It doesn't seem to be used anywhere. This led me to research how these indexes are stored. I note that what we're doing here is to create another regular table and a btree index on top of it, and those are the ones that actually store the index data. This seems grotty and completely unlike the way we do things elsewhere (compare GIN indexes which have rbtrees inside them). +1 on that. I don't know about grottiness, but it certainly seems like it would deadlock like crazy. Which another product's bitmap indexes are infamous for, but since we don't need to store visibility information in our indexes, hopefully we can do better. Cheers, Jeff
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Wed, Sep 25, 2013 at 4:46 AM, David Rowley dgrowle...@gmail.com wrote: Ok, I think I've managed to narrow the performance gap to just about nothing but noise, though to do this the code is now a bit bigger. I've added a series of tests to see if the padding is 0 and if it's not then I'm doing things the old way. I've also added a some code which does a fast test to see if it is worth while calling the padding processing function. This is just a simple if (*p = '9'), I'm not completely happy with that as it does look a bit weird, but to compensate I've added a good comment to explain what it is doing. Please find attached the new patch ... version v0.5 and also updated benchmark results. Are you sure this is the right set of benchmark results? This still reflects a 15-18% slowdown AFAICS. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Here's an updated version of this patch, with fixes to all the bugs reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and Amit Kapila for the reports. I'm not very happy with the use of a separate relation fork for storing this data. Using an existing fork number rather than creating a new one avoids some of them (like, the fact that we loop over all known fork numbers in various places, and adding another one will add latency in all of those places, particularly when there is a system call in the loop) but not all of them (like, what happens if the index is unlogged? we have provisions to reset the main fork but any others are just removed; is that OK?), and it also creates some new ones (like, files having misleading names). More generally, I fear we really opened a bag of worms with this relation fork stuff. Every time I turn around I run into a problem that could be solved by adding another relation fork. I'm not terribly sure that it was a good idea to go that way to begin with, because we've got customers who are unhappy about 3 files/heap due to inode consumption and slow directory lookups. I think we would have been smarter to devise a strategy for storing the fsm and vm pages within the main fork in some fashion, and I tend to think that's the right solution here as well. Of course, it may be hopeless to put the worms back in the can at this point, and surely these indexes will be lightly used compared to heaps, so it's not incrementally exacerbating the problems all that much. But I still feel uneasy about widening use of that mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Extra functionality to createuser
Sitting on my todo list for a while has been to consider the idea of adding a bit of additional functionality to createuser. One of the functions of CREATE ROLE is to associate the role with other roles, thus... create role my_new_user nosuperuser nocreatedb login IN ROLE app_readonly_role, app2_writer_role; That isn't something that I can do using createuser; to do that, I would need to submit two requests separately: PGUSER=postgres createuser -D -S -l my_new_user PGUSER=postgres psql -c grant app_readonly_role, app2_writer_role to my_new_user; I could certainly change over to using psql to do all the work, but it would be rather nice if createuser had (say) a -g option which allowed specifying the set of roles that should be assigned. Thus, the above commands might be replaced by: PGUSER=postgres createuser -D -S -l -g app_readonly_role,app2_writer_role my_new_user Would this be worth adding to the ToDo list? -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] backup.sgml-cmd-v003.patch
On Sep 3, 2013, at 6:56 AM, Karl O. Pinc k...@meme.com wrote: On 07/31/2013 12:08:12 PM, Ivan Lezhnjov IV wrote: Patch filename: backup.sgml-cmd-v003.patch The third version of this patch takes into consideration feedback received after original submission (it can be read starting from this message http://www.postgresql.org/message-id/CA +Tgmoaq-9D_mst113TdW=ar8mpgbc+x6t61azk3emhww9g...@mail.gmail.com) Essentially, it addresses the points that were raised in community feedback and offers better worded statements that avoid implying that some features are being deprecated when it isn't the case. We also spent some more time polishing other details, like making adjustments to the tone of the text so that it sounds more like a manual, and less like a blog post. More importantly, this chapter now makes it clear that superuser privileges are not always required to perform a successful backup because in practice as long as the role used to make a backup has sufficient read privileges on all of the objects a user is interested in it's going to work just fine. We also mention and show examples of usage for pg_restore and pigz alongside with gzip, and probably something else too. Hi Ivan, I'm reviewing your patch. I did not read the entirety of the thread referenced above. I apologize if this causes problems. Attached is backup-sgml-cmnd-v003_1.patch to be applied on top of backup-sgml-cmnd-v003.patch and containing my edits. You will eventually want to produce a single patch (but see below). Meanwhile this might help you keep track of my changes. Attached also is your original v3 patch. --- Cleaned up and clarified here and there. The bit about OIDs being depreciated might properly belong in a separate patch. The same might be said about adding mention of pigz. If you submit these as separate patch file attachments they can always be applied in a single commit, but the reverse is more work for the committer. (Regardless, I see no reason to have separate commitfest entries or anything other than multiple attachments to the email that finalizes our discussion.) Hello, took me a while to get here, but a lot has been going on... Okay, I'm new and I don't know why a single patch like this is more work for a commiter? Just so I understand and know. Minimally modified to note the existence of directory dumps. It may be that the utility/flexibility of directory dumps should also be mentioned. My thought is that the part beginning with The options in detail are: should not describe all the possibilities for the --format option, that being better left to the reference section. Likewise, this being prose, it might be best to describe all the options in-line, instead of presented as a list. I have left it as-is for you to improve as seen fit. Agreed, it probably looks better as a sentence. I have frobbed your programlisting to adjust the indentation and line-wrap style. I submit it here for consideration in case this style is attractive. This is nothing but conceit. We should use the same style used elsewhere in the documentation. (I can't think offhand of a place to look for a line-wrapped shell example. If you can't find one I'll take a look and if neither of us finds one we'll then have choices.) Looks good to me. I don't know that it's necessary to include pigz examples, because it sounds like pigz is a drop-in gzip replacement. I've left your examples in, in case you feel they are necessary. We do. We believe it can encourage more people to consider using it. The way we see it, most people seem to be running mutlicore systems these days, yet many simply are not aware of pigz. We have been routinely informing our customers of pigz as an alternative to tried and tested gzip when helping optimize their configurations, and all of them without exception went for it. I guess everybody just likes to squeeze some extra juice for free. The existing text of the SQL Dump section could use some alteration to reduce redundancy and add clarity. I'm thinking specifically of mention of pg_restore as being required to restore custom format backups and of the default pg_dump output being not just plain text but being a collection of SQL commands. Yes, the latter is obvious upon reflection, psql being what it is, but I think it would be helpful to spell this out. Especially in the context of the current patch. There could well be other areas like this to be addressed. I don't quite follow you here. I mean, I kinda understand what you mean in general, but when I look at the text I fail to see what you had in mind specifically. For example, pg_restore is mentioned only 3 times in section 24.1. Each mention seems pretty essential to me. And the text flow is pretty natural. Also, about plain text format being a collection of SQL commands. The very first paragraph of the section 24.1 reads The idea
Re: [HACKERS] Minmax indexes
Robert Haas escribió: On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Here's an updated version of this patch, with fixes to all the bugs reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and Amit Kapila for the reports. I'm not very happy with the use of a separate relation fork for storing this data. I understand this opinion, as I considered it myself while developing it. Also, GIN already does things this way. Perhaps I should just bite the bullet and do this. Using an existing fork number rather than creating a new one avoids some of them (like, the fact that we loop over all known fork numbers in various places, and adding another one will add latency in all of those places, particularly when there is a system call in the loop) but not all of them (like, what happens if the index is unlogged? we have provisions to reset the main fork but any others are just removed; is that OK?), and it also creates some new ones (like, files having misleading names). All good points. Index scans will normally access the revmap in sequential fashion; it would be enough to chain revmap pages, keeping a single block number in the metapage pointing to the first one, and subsequent ones are accessed from a next block number in each page. However, heap insertion might need to access a random revmap page, and this would be too slow. I think it would be enough to keep an array of block numbers in the index's metapage; the metapage would be share locked on every scan and insert, but that's not a big deal because exclusive lock would only be needed on the metapage to extend the revmap, which would be a very infrequent operation. As this will require some rework to this code, I think it's fair to mark this as returned with feedback for the time being. I will return with an updated version soon, fixing the relation fork issue as well as the locking and visibility bugs reported by Erik. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Erik Rijkers wrote: I have the impression it's not quite working correctly. The attached program returns different results for different values of enable_bitmapscan (consistently). Clearly there's some bug somewhere. I'll investigate it more. ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. ) Not documentation -- that would also be a bug which needs to be fixed. Thanks for testing. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On 9/26/13 8:27 AM, Noah Misch wrote: On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote: On Fri, Sep 20, 2013 at 7:44 AM, Andres Freundand...@2ndquadrant.com wrote: Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. I like the concept and have no improvements on the name. Are there any other likely candidates for inclusion in that directory other than this stuff? pg_xlog Isn't it also pointless to backup temp objects as well as non-logged tables? Or is the purpose of pg_local to be a home for things that MUST NOT be backed up as opposed to items where backing them up is pointless? -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6
On 09/25/2013 01:20 PM, Steve Singer wrote: On 09/25/2013 11:08 AM, Andres Freund wrote: On 2013-09-25 11:01:44 -0400, Steve Singer wrote: On 09/17/2013 10:31 AM, Andres Freund wrote: This patch set now fails to apply because of the commit Rename various freeze multixact variables. And I am even partially guilty for that patch... Rebased patches attached. While testing the logical replication changes against my WIP logical slony I am sometimes getting error messages from the WAL sender of the form: unexpected duplicate for tablespace X relfilenode X Any chance you could provide a setup to reproduce the error? The steps to build a setup that should reproduce this error are: 1. I had apply the attached patch on top of your logical replication branch so my pg_decode_init would now if it was being called as part of a INIT_REPLICATION or START_REPLICATION. Unless I have misunderstood something you probably will want to merge this fix in 2. Get my WIP for adding logical support to slony from: g...@github.com:ssinger/slony1-engine.git branch logical_repl (4af1917f8418a) (My code changes to slony are more prototype level code quality than production code quality) 3. cd slony1-engine ./configure --with-pgconfigdir=/usr/local/pg94wal/bin (or whatever) make make install 4. Grab the clustertest framework JAR from https://github.com/clustertest/clustertest-framework and build up a clustertest jar file 5. Create a file slony1-engine/clustertest/conf/java.conf that contains the path to the above JAR file as a shell variable assignment: ie CLUSTERTESTJAR=/home/ssinger/src/clustertest/clustertest_git/build/jar/clustertest-coordinator.jar 6. cp clustertest/conf/disorder.properties.sample clustertest/conf/disorder.properties edit disorder.properites to have the proper values for your environment. All 6 databases can point at the same postgres instance, this test will only actually use 2 of them(so far). 7. Run the test cd clustertest ./run_all_disorder_tests.sh This involves having the slon connect to the walsender on the database test1 and replicate the data into test2 (which is a different database on the same postmaster) If this setup seems like too much effort I can request one of the commitfest VM's from Josh and get everything setup there for you. Steve Any ideas? I'll look into it. Could you provide any context to what youre doing that's being decoded? I've determined that when in this test the walsender seems to be hitting this when it is decode the transactions that are behind the slonik commands to add tables to replication (set add table, set add sequence). This is before the SUBSCRIBE SET is submitted. I've also noticed something else that is strange (but might be unrelated). If I stop my slon process and restart it I get messages like: WARNING: Starting logical replication from 0/a9321360 ERROR: cannot stream from 0/A9321360, minimum is 0/A9320B00 Where 0/A9321360 was sent in the last packet my slon received from the walsender before the restart. If force it to restart replication from 0/A9320B00 I see datarows that I appear to have already seen before the restart. I think this is happening when I process the data for 0/A9320B00 but don't get the feedback message my slon was killed. Is this expected? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench progress report improvements - split 3 v2 - A
Patch (4): Redefine latency as reported by pgbench and report lag more. Here is a first partial patch, which focusses on measuring latency and reporting the measure under --progress. ** Improve pgbench measurements progress report Measure transaction latency instead under --rate and --progress. The previous computed figure does not make sense under throttling: as sleep throttling time was included in the figures, the displayed results were plain false. The latency and its standard deviation are printed out under progress and in the final report when available. It could be made always available, but that would require to accept additional gettimeofday calls. I do not think that there is a performance issue here, but there is significant opposition to the idea. Sample output under benchmarking with --progress=1 progress: 1.0 s, 2626.1 tps, 0.374 stddev 0.597 ms lat progress: 2.0 s, 2766.6 tps, 0.358 stddev 0.588 ms lat progress: 3.0 s, 2567.4 tps, 0.385 stddev 0.665 ms lat progress: 4.0 s, 3014.2 tps, 0.328 stddev 0.593 ms lat progress: 5.0 s, 2959.3 tps, 0.334 stddev 0.553 ms lat ... progress: 16.0 s, 5009.2 tps, 0.197 stddev 0.381 ms lat ... progress: 24.0 s, 7051.2 tps, 0.139 stddev 0.284 ms lat ... progress: 50.0 s, 6860.5 tps, 0.143 stddev 0.052 ms lat ... The warmup is quite fast because the DB is on a SSD. In the beginning the standard deviation is well over the average transaction time, but when the steady state is reached (later) it is much below. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 06dd709..a48498a 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -172,6 +172,7 @@ int agg_interval; /* log aggregates instead of individual * transactions */ int progress = 0; /* thread progress report every this seconds */ int progress_nclients = 0; /* number of clients for progress report */ +int progress_nthreads = 0; /* number of threads for progress report */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ @@ -214,6 +215,8 @@ typedef struct int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + int64 txn_latencies; /* cumulated latencies */ + int64 txn_sqlats; /* cumulated square latencies */ bool is_throttled; /* whether transaction throttling is done */ int use_file; /* index in sql_files for this client */ bool prepared[MAX_FILES]; @@ -243,6 +246,8 @@ typedef struct { instr_time conn_time; int xacts; + int64 latencies; + int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; } TResult; @@ -1003,6 +1008,21 @@ top: } /* + * always record latency under progress or throttling + */ + if ((progress || throttle_delay) commands[st-state + 1] == NULL) + { + instr_time now, diff; + int64 latency; + INSTR_TIME_SET_CURRENT(now); + diff = now; + INSTR_TIME_SUBTRACT(diff, st-txn_begin); + latency = INSTR_TIME_GET_MICROSEC(diff); + st-txn_latencies += latency; + st-txn_sqlats += latency*latency; + } + + /* * if transaction finished, record the time it took in the log */ if (logfile commands[st-state + 1] == NULL) @@ -1191,8 +1211,8 @@ top: goto top; } - /* Record transaction start time if logging is enabled */ - if (logfile st-state == 0) + /* Record transaction start time under logging, progress or throttling */ + if ((logfile || progress || throttle_delay) st-state == 0) INSTR_TIME_SET_CURRENT(st-txn_begin); /* Record statement start time if per-command latencies are requested */ @@ -2096,6 +2116,7 @@ static void printResults(int ttype, int normal_xacts, int nclients, TState *threads, int nthreads, instr_time total_time, instr_time conn_total_time, + int64 total_latencies, int64 total_sqlats, int64 throttle_lag, int64 throttle_lag_max) { double time_include, @@ -2135,6 +2156,22 @@ printResults(int ttype, int normal_xacts, int nclients, normal_xacts); } + if (throttle_delay || progress) + { + /* compute and show latency average and standard deviation */ + double latency = 0.001 * total_latencies / normal_xacts; + double sqlat = (double) total_sqlats / normal_xacts; + printf(latency average: %.3f ms\n + latency stddev: %.3f ms\n, + latency, 0.001 * sqrt(sqlat - 100*latency*latency)); + } + else + { + /* only an average latency computed from the duration is available */ + printf(latency average: %.3f ms\n, + 1000.0 * duration / normal_xacts); + } + if (throttle_delay) { /* @@ -2143,7 +2180,7 @@ printResults(int ttype, int normal_xacts, int nclients, * transaction. The measured lag may be caused by thread/client load, * the database load, or the Poisson
Re: [HACKERS] Minmax indexes
On 9/26/13 12:00 PM, Robert Haas wrote: More generally, I fear we really opened a bag of worms with this relation fork stuff. Every time I turn around I run into a problem that could be solved by adding another relation fork. I'm not terribly sure that it was a good idea to go that way to begin with, because we've got customers who are unhappy about 3 files/heap due to inode consumption and slow directory lookups. I think we would have been smarter to devise a strategy for storing the fsm and vm pages within the main fork in some fashion, and I tend to think that's the right solution here as well. Of course, it may be hopeless to put the worms back in the can at this point, and surely these indexes will be lightly used compared to heaps, so it's not incrementally exacerbating the problems all that much. But I still feel uneasy about widening use of that mechanism. Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward. If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using multiple directories instead of glomming everything into one place? -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 4:43 AM, Bruce Momjian br...@momjian.us wrote: So, I guess my question is if we are only bloating on a contended operation, do we expect that to happen so much that bloat is a problem? Maybe I could have done a better job of explaining the nature of my concerns around bloat. I am specifically concerned about bloat and the clean-up of bloat that occurs between (or during) value locking and eventual row locking, because of the necessarily opportunistic nature of the way we go from one to the other. Bloat, and the obligation to clean it up synchronously, make row lock conflicts more likely. Conflicts make bloat more likely, because a conflict implies that another iteration, complete with more bloat, is necessary. When you consider that the feature will frequently be used with the assumption that updating is a much more likely outcome, it becomes clear that we need to be careful about this sort of interplay. Having said all that, I would have no objection to some reasonable, bound amount of bloat occurring elsewhere if that made sense. For example, I'd certainly be happy to consider the question of whether or not it's worth doing a kind of speculative heap insertion before acquiring value locks, because that doesn't need to happen again and again in the same, critical place, in the interim between value locking and row locking. The advantage of doing that particular thing would be to reduce the duration that value locks are held - the disadvantages would be the *usual* disadvantages of bloat. However, this is obviously a premature discussion to have now, because the eventual exact nature of value locks are not known. I think the big objection to the patch is the additional code complexity and the potential to slow down other sessions. If it is only bloating on a contended operation, are these two downsides worth avoiding the bloat? I believe that all other schemes proposed have some degree of bloat even in the uncontended case, because they optimistically assume than an insert will occur, when in general an update is perhaps just as likely, and will bloat just the same. So, as I've said before, definition of uncontended is important here. There is no reason to assume that alternative proposals will affect concurrency any less than my proposal - the buffer locking thing certainly isn't essential to my design. You need to weigh things like WAL-logging multiples times, that other proposals have. You're right to say that all of this is complex, but I really think that quite apart from anything else, my design is simpler than others. For example, the design that Robert sketched would introduce a fairly considerable modularity violation, per my recent remarks to him, and actually plastering over that would be a considerable undertaking. Now, you might counter, but those other designs haven't been worked out enough. That's true, but then my efforts to work them out further by pointing out problems with them haven't gone very far. I have sincerely tried to see a way to make them work. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Tue, Sep 24, 2013 at 10:15 PM, Peter Geoghegan p...@heroku.com wrote: Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. I think I agree with that. I don't think I remember hearing that proposed. If we're really lucky, maybe the value locking stuff can be generalized or re-used as part of a btree index insertion buffer feature. Well, that would be nifty. Also, I tend to think that we might want to define the operation as a REPLACE-type operation with respect to a certain set of key columns; and so we'll do the insert-or-update behavior with respect only to the index on those columns and let the chips fall where they may with respect to any others. In that case this all becomes much less urgent. Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT, not try an INSERT, then maybe mark the heap tuple if there's a unique index dup and then go UPDATE the conflicting tuple. I mention this only because the term REPLACE has a certain baggage, and I feel it's important to be careful about such things. I see. Well, we could try to mimic their semantics, I suppose. Those semantics seem like a POLA violation to me; who would have thought that a REPLACE could delete multiple tuples? But what do I know? The only way that's going to work is if you say use this unique index, which will look pretty gross in DML. That might actually be okay with me if we had somewhere to go from there in a future release, but I doubt that's the case. Another issue is that I'm not sure that this helps Andres much (or rather, clients of the logical changeset generation infrastructure that need to do conflict resolution), and that matters a lot to me here. Yeah, it's kind of awful. Suppose we define the operation as REPLACE rather than INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: 1. Try to insert a tuple. If no unique index conflicts occur, stop. 2. Note the identity of the conflicting tuple and mark the inserted heap tuple dead. 3. If the conflicting tuple's inserting transaction is still in progress, wait for the inserting transaction to end. Sure, this is basically what the code does today (apart from marking a just-inserted tuple dead). 4. If the conflicting tuple is dead (e.g. because the inserter aborted), start over. Start over from where? I presume you mean the index tuple insertion, as things are today. Or do you mean the very start? Yes, that's what I meant. 5. If the conflicting tuple's key columns no longer match the key columns of the REPLACE operation, start over. What definition of equality or inequality? Binary equality, same as we'd use to decide whether an update can be done HOT. 7. Update the tuple, even though it may be invisible to our snapshot (a deliberate MVCC violation!). I realize that you just wanted to sketch a design, but offhand I think that the basic problem with what you describe is that it isn't accepting of the inevitability of there being a disconnect between value and row locking. Also, this doesn't fit with any roadmap for getting a real upsert, Well, there are two separate issues here: what to do about MVCC, and how to do the locking. From an MVCC perspective, I can think of only two behaviors when the conflicting tuple is committed but invisible: roll back, or update it despite it being invisible. If you're saying you don't like either of those choices, I couldn't agree more, but I don't have a third idea. If you do, I'm all ears. In terms of how to do the locking, what I'm mostly saying is that we could try to implement this in a way that invents as few new concepts as possible. No promise tuples, no new SLRU, no new page-level bits, just index tuples and heap tuples and so on. Ideally, we don't even change the WAL format, although step 2 might require a new record type. To the extent that what I actually described was at variance with that goal, consider it a defect in my explanation rather than an intent to vary. I think there's value in considering such an implementation because each new thing that we have to introduce in order to get this feature is a possible reason for it to be rejected - for modularity reasons, or because it hurts performance elsewhere, or because it's more code we have to maintain, or whatever. Now, what I hear you saying is, gee, the performance of that might be terrible. I'm not sure that I believe that, but it's possible that you're right. Much seems to depend on what you think the frequency of conflicts will be, and perhaps I'm assuming it will be low while you're assuming a higher value. Regardless, if the performance of the sort of implementation I'm talking about would be terrible (under some agreed-upon definition of what terrible means in this context), then that's a good argument for not doing it that way. I'm just not convinced that's the case.
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote: When you consider that the feature will frequently be used with the assumption that updating is a much more likely outcome, it becomes clear that we need to be careful about this sort of interplay. I think one thing that's pretty clear at this point is that almost any version of this feature could be optimized for either the insert case or the update case. For example, my proposal could be modified to search for a conflicting tuple first, potentially wasting an index probes (or multiple index probes, if you want to search for potential conflicts in multiple indexes) if we're inserting, but winning heavily in the update case. As written, it's optimized for the insert case. In fact, I don't know how to know which of these things we should optimize for. I wrote part of the code for an EDB proprietary feature that can do insert-or-update loads about 6 months ago[1], and we optimized it for updates. That was not, however, a matter of principal; it just turned out to be easier to implement that way. In fact, I would have assumed that the insert-mostly case was more likely, but I think the real answer is that some environments will be insert-mostly and some will be update-mostly and some will be a mix. If we really want to squeeze out every last drop of possible performance, we might need two modes: one that assumes we'll mostly insert, and another that assumes we'll mostly update. That seems a frustrating amount of detail to have to expose to the user; an implementation that was efficient in both cases would be very desirable, but I do not have a good idea how to get there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] In case you're wondering, attempting to use that feature to upsert an invisible tuple will result in the load failing with a unique index violation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby j...@nasby.net wrote: Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward. Well, they sorta do the trick, but see e.g. commit ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only code that's poorly-optimized for multiple forks; IOW, every time someone adds a new fork, there's a system-wide cost to that, even if that fork is only used in a tiny percentage of the relations that exist in the system. It's tempting to think that we can use the fork mechanism every time we have multiple logical streams of blocks within a relation and don't want to figure out a way to multiplex them onto the same physical file. However, the reality is that the fork mechanism isn't up to the job. I certainly don't want to imply that we shouldn't have gone in that direction - both the fsm and the vm are huge steps forward, and we wouldn't have gotten them in 8.4 without that mechanism. But they haven't been entirely without their own pain, too, and that pain is going to grow the more we push in the direction of relying on forks. If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using multiple directories instead of glomming everything into one place? That's not the only complaint about forks - but I support what you're proposing there anyhow, because it will be helpful to users with lots of relations regardless of what we do or do not decide to do about forks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Fri, Sep 27, 2013 at 4:44 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Sep 25, 2013 at 4:46 AM, David Rowley dgrowle...@gmail.com wrote: Ok, I think I've managed to narrow the performance gap to just about nothing but noise, though to do this the code is now a bit bigger. I've added a series of tests to see if the padding is 0 and if it's not then I'm doing things the old way. I've also added a some code which does a fast test to see if it is worth while calling the padding processing function. This is just a simple if (*p = '9'), I'm not completely happy with that as it does look a bit weird, but to compensate I've added a good comment to explain what it is doing. Please find attached the new patch ... version v0.5 and also updated benchmark results. Are you sure this is the right set of benchmark results? This still reflects a 15-18% slowdown AFAICS. I think I must have forgot to save it before I emailed it... Here's the correct version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company log_line_prefix_benchmark_stresslog_v0.5.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On Thu, Sep 26, 2013 at 2:45 PM, Jim Nasby j...@nasby.net wrote: On 9/26/13 8:27 AM, Noah Misch wrote: On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote: On Fri, Sep 20, 2013 at 7:44 AM, Andres Freundand...@2ndquadrant.com wrote: Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. Wondered exactly about that as soon as you've mentioned pg_basebackup. pg_local/? That seems reasonable. It's not totally transparent what that's supposed to mean, but it's fairly mnemonic once you know. Other suggestions welcome, if anyone has ideas. I like the concept and have no improvements on the name. Are there any other likely candidates for inclusion in that directory other than this stuff? pg_xlog Isn't it also pointless to backup temp objects as well as non-logged tables? Or is the purpose of pg_local to be a home for things that MUST NOT be backed up as opposed to items where backing them up is pointless? I don't know. I found it surprising that Noah suggested including pg_xlog; that's certainly not node-local state in any meaningful sense, the way dsm stuff is. But I guess if pg_basebackup excludes it it arguably qualifies. However, it'd be nice to advertise that pg_local can be cleared when the server is shut down, and you certainly CaN nOt do that to pg_xlog. Whee! I think the way I'd summarize the state of this patch is that everyone who has looked at it more or less agrees that the big picture is right, but there are details, which I think everyone admits are pretty much corner cases, where different people have different ideas about which way to go and what risks and benefits might be thereby incurred. I'll not pretend that my opinions are categorically better than any others, but of course I like them because they are mine. Figuring out just what to do to about those last details seems challenging; no answer will completely please everyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls
I gave this a quick look. It took me a while to figure out how to apply it -- turns out you used the ignore whitespace option to diff, so the only way to apply it was with patch -p1 --ignore-whitespace. Please don't do that. I ran pgindent over the patched code and there were a number of changes. I suggest you run that over your local copy before your next submission, to avoid the next official run to mangle your stuff in unforeseen ways. For instance, the comment starting with A slight edge case would be mangled; I suggest enclosing that in /* to avoid the problem. (TBH that's the only interesting thing, but avoiding that kind of breakage is worth it IMHO.) First thing I noticed was the funky bms_initialize thingy. There was some controversy upthread about the use of bitmapsets, and it seems you opted for not using them for the constant case as suggested by Jeff; but apparently the other comment by Robert about the custom bms initializer went largely ignored. I agree with him that this is grotty. However, the current API to get partition-local memory is too limited as there's no way to change to the partition's context; instead you only get the option to allocate a certain amount of memory and return that. I think the easiest way to get around this problem is to create a new windowapi.h function which returns the MemoryContext for the partition. Then you can just allocate the BMS in that context. But how do we ensure that the BMS is allocated in a context? You'd have to switch contexts each time you call bms_add_member. I don't have a good answer to this. I used this code in another project: /* * grow the visited bitmapset to the index' current size, to avoid * repeated repalloc's */ { BlockNumber lastblock; lastblock = RelationGetNumberOfBlocks(rel); visited = bms_add_member(visited, lastblock); visited = bms_del_member(visited, lastblock); } This way, I know the bitmapset already has enough space for all the bits I need and there will be no further allocation. But this is also grotty. Maybe we should have a new entry point in bitmapset.h, say bms_grow that ensures you have enough space for that many bits. Or perhaps add a MemoryContext member to struct Bitmapset, so that all allocations occur therein. I'm not too sure I follow the parse_agg.c changes, but don't you need to free the clone in the branch that finds a duplicate window spec? Is this parent/child relationship thingy a preexisting concept, or are you just coming up with it? It seems a bit unfamiliar to me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
On 9/20/13 2:42 AM, KONDO Mitsumasa wrote: I create gaussinan distribution pgbench patch that can access records with gaussian frequency. And I submit this commit fest. This patch no longer applies. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Thu, Sep 26, 2013 at 8:56 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 20:47:33 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. Yes. I've described how it *has* to be done, not how it's done. The current method of going straight to isready = false for the original index will result in wrong results because it's not updated anymore while it's still being used. The index being dropped at the end of process is not the former index, but the concurrent index. The index used after REINDEX CONCURRENTLY is the old index but with the new relfilenode. That's not relevant unless I miss something. After phase 4 both indexes are valid (although only the old one is flagged as such), but due to the switching of the relfilenodes backends could have either of both open, depending on the time they built the relcache entry. Right? Then you go ahead and mark the old index - which still might be used! - as dead in phase 5. Which means other backends might (again, depending on the time they have built the relcache entry) not update it anymore. In read committed we very well might go ahead and use the index with the same plan as before, but with a new snapshot. Which now will miss entries. In this case, doing a call to WaitForOldSnapshots after the swap phase is enough. It was included in past versions of the patch but removed in the last 2 versions. Btw, taking the problem from another viewpoint... This feature has now 3 patches, the 2 first patches doing only code refactoring. Could it be possible to have a look at those ones first? Straight-forward things should go first, simplifying the core feature evaluation. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-09-27 05:41:26 +0900, Michael Paquier wrote: In this case, doing a call to WaitForOldSnapshots after the swap phase is enough. It was included in past versions of the patch but removed in the last 2 versions. I don't think it is. I really, really suggest following the protocol used by index_drop down to the t and document every *slight* deviation carefully. We've had more than one bug in index_drop's concurrent feature. Btw, taking the problem from another viewpoint... This feature has now 3 patches, the 2 first patches doing only code refactoring. Could it be possible to have a look at those ones first? Straight-forward things should go first, simplifying the core feature evaluation. I haven't looked at them in detail, but they looked good on a quick pass. I'll make another pass, but that won't be before, say, Tuesday. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 03:33:34PM -0400, Robert Haas wrote: On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote: When you consider that the feature will frequently be used with the assumption that updating is a much more likely outcome, it becomes clear that we need to be careful about this sort of interplay. I think one thing that's pretty clear at this point is that almost any version of this feature could be optimized for either the insert case or the update case. For example, my proposal could be modified to search for a conflicting tuple first, potentially wasting an index probes (or multiple index probes, if you want to search for potential conflicts in multiple indexes) if we're inserting, but winning heavily in the update case. As written, it's optimized for the insert case. I assumed the code was going to do the index lookups first without a lock, and take the appropriate action, insert or update, with fallbacks for guessing wrong. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com wrote: I think I must have forgot to save it before I emailed it... Here's the correct version. Ah ha. Looks better. I'm working on getting this committed now. Aside from some stylistic things, I've found one serious functional bug, which is that you need to test padding != 0, not padding 0, when deciding which path to take. No need to send a new patch, I've already fixed it in my local copy... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator - Review
On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote: On 09/12/2013 06:27 PM, Kevin Grittner wrote: Attached is a patch for a bit of infrastructure I believe to be necessary for correct behavior of REFRESH MATERIALIZED VIEW CONCURRENTLY as well as incremental maintenance of matviews. Here is attempt at a review of the v1 patch. There has been a heated discussion on list about if we even want this type of operator. I will try to summarize it here The arguments against it are * that a record_image_identical call on two records that returns false can still return true under the equality operator, and the records might (or might not) appear to be the same to users * Once we expose this as an operator via SQL someone will find it and use it and then complain when we change things such as the internal representation of a datatype which might break there queries * The differences between = and record_image_identical might not be understood by everywhere who finds the operator leading to confusion * Using a criteria other than equality (the = operator provided by the datatype) might cause problems if we later provide 'on change' triggers for materialized views The arguments for this patch are * We want the materialized view to return the same value as would be returned if the query were executed directly. This not only means that the values should be the same according to a datatypes = operator but that they should appear the same 'to the eyeball'. * Supporting the materialized view refresh check via SQL makes a lot of sense but doing that requires exposing something via SQL * If we adequately document what we mean by record_image_identical and the operator we pick for this then users shouldn't be surprised at what they get if they use this This is a good summary. I think there are a few things that make this issue difficult to decide: 1. We have to use an operator to give the RMVC (REFRESH MATERIALIZED VIEW CONCURRENTLY) SPI query the ability to optimize this query. If we could do this with an SQL or C function, there would be less concern about the confusion caused by adding this capability. Question: If we are comparing based on some primary key, why do we need this to optimize? Certainly the primary key index will be used, no? 2. Agregates are already non-deterministic, so some feel that adding this feature doesn't improve much. The counter-argument is that without the binary row comparison ability, rows could be returned that could _never_ have been produced by the base data, which is more of a user surprise than non-deterministic rows. 3. Our type casting and operators are already complex, and adding another set of operators only compounds that. 4. There are legitimate cases where tool makers might want the ability to compare rows binarily, e.g. for replication, and adding these operators would help with that. I think we need to see a patch from Kevin that shows all the row comparisons documented and we can see the impact of the user-visibile part. One interesting approach would be to only allow the operator to be called from SPI queries. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pg_upgrade: Split off pg_fatal() from pg_log()
On Thu, Sep 12, 2013 at 10:50:42PM -0400, Peter Eisentraut wrote: The experiences with elog() and ereport() have shown that having one function that can return or not depending on some log level parameter isn't a good idea when you want to communicate well with the compiler. In pg_upgrade, there is a similar case with the pg_log() function. Since that isn't a public API, I'm proposing to change it and introduce a separate function pg_fatal() for those cases where the calls don't return. Fine with me. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Sep 25, 2013 at 5:22 PM, Peter Eisentraut pete...@gmx.net wrote: On 9/23/13 5:36 PM, Alexander Korotkov wrote: In the attached version of patch double finding of ItemPointer during insert is avoided. Overhead becomes lower as expected. Fails cpluspluscheck: ./src/include/access/gin_private.h: In function ‘char* ginDataPageLeafReadItemPointer(char*, ItemPointer)’: ./src/include/access/gin_private.h:797:2: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] Thanks. Fixed in attached version of patch. -- With best regards, Alexander Korotkov. gin-packed-postinglists-7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Thu, Sep 26, 2013 at 5:18 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com wrote: I think I must have forgot to save it before I emailed it... Here's the correct version. Ah ha. Looks better. I'm working on getting this committed now. Aside from some stylistic things, I've found one serious functional bug, which is that you need to test padding != 0, not padding 0, when deciding which path to take. No need to send a new patch, I've already fixed it in my local copy... Committed now. Let me know if I broke anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Wait free LW_SHARED acquisition
Hello, We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. Increasing the padding to a full cacheline helps making the partitioning of the partition space actually effective (before it's essentially halved on a read-mostly workload), but that still leaves one with very hot spinlocks. So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 That's rougly 400%. So, nice little improvement. Unless - not entirely unlikely - I fucked up and it's fast because it's broken. Anyway, here's the algorith I chose to implement: The basic idea is to have a single 'uint32 lockcount' instead of the former 'char exclusive' and 'int shared' and to use an atomic increment to acquire the lock. That's fairly easy to do for rw-spinlocks, but a lot harder for something like LWLocks that want to wait in the OS. Exlusive lock acquisition: Use an atomic compare-and-exchange on the lockcount variable swapping in EXCLUSIVE_LOCK/131/0x8000 if and only if the current value of lockcount is 0. If the swap was not successfull, we have to wait. Shared lock acquisition: Use an atomic add (lock xadd) to the lockcount variable to add 1. If the value is bigger than EXCLUSIVE_LOCK we know that somebody actually has an exclusive lock, and we back out by atomically decrementing by 1 again. If so, we have to wait for the exlusive locker to release the lock. Queueing Wakeup: Whenever we don't get a shared/exclusive lock we us nearly the same queuing mechanism as we currently do. While we probably could make it lockless as well, the queue currently is still protected by the good old spinlock. Relase: Use a atomic decrement to release the lock. If the new value is zero (we get that atomically), we know we have to release waiters. And the real world: Now, as you probably have noticed, naively doing the above has two glaring race conditions: 1) too-quick-for-queueing: We try to lock using the atomic operations and notice that we have to wait. Unfortunately until we have finished queuing, the former locker very well might have already finished it's work. 2) spurious failed locks: Due to the logic of backing out of shared locks after we unconditionally added a 1 to lockcount, we might have prevented another exclusive locker from getting the lock: 1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success 2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1 3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK 4) Session B: LWLockRelease() 5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait. 6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1 7) Session B: LWLockAcquire(LW_SHARED) - wait So now we can have both B) and C) waiting on a lock that nobody is holding anymore. Not good. The solution: We use a two phased attempt at locking: Phase 1: Try to do it atomically, if we succeed, nice Phase 2: Add us too the waitqueue of the lock Phase 3: Try to grab the lock again, if we succeed, remove ourselves from the queue Phase 4: Sleep till wakeup, goto Phase 1 This protects us against both problems from above: 1) Nobody can release too quick, before we're queued, after Phase 2 since we're already queued. 2) If somebody spuriously got blocked from acquiring the lock, they will get queued in Phase 2 and we can wake them up if neccessary. Now, there are lots of tiny details to handle additionally to those, but those seem better handled by looking at the code? - The above algorithm only works for LWLockAcquire, not directly for LWLockAcquireConditional where we don't want to wait. In that case we just need to retry acquiring the lock until we're sure we didn't disturb anybody in doing so. - we can get removed from the queue of waiters in Phase 3, before we remove ourselves. In that case we need to absorb the wakeup. - Spurious locks can prevent us from recognizing a lock that's free during release. Solve it by checking for existing waiters whenever an exlusive lock is released. I've done a couple of off-hand benchmarks and so far I can confirm that everything using lots of shared locks benefits greatly and everything else doesn't really change much. So far I've seen mostly some slight improvements in exclusive lock heavy workloads, but those were pretty small. It's also very important to mention that those speedups are only there on multi-socket machines. From what I've benchmarked so far in LW_SHARED heavy workloads with 1 socket you get ~5-10%, 2 sockets 20-30% and finally and
Re: [HACKERS] Wait free LW_SHARED acquisition
On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund and...@2ndquadrant.com wrote: We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. That's unfortunate. I saw someone complain about what sounds like exactly the same issue on Twitter yesterday: https://twitter.com/Roguelazer/status/382706273927446528 I tried to engage with him, but was unsuccessful. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 2013-09-26 16:56:30 -0700, Peter Geoghegan wrote: On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund and...@2ndquadrant.com wrote: We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. That's unfortunate. I saw someone complain about what sounds like exactly the same issue on Twitter yesterday: Well, fortunately there's a solution, as presented here ;) There's another bottleneck in the heaps of PinBuffer() calls in such workloads, that present themselves after fixing the lwlock contention, at least in my tests. I think I see a solution there, but let's fix this first though ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator - Review
On Thu, Sep 26, 2013 at 05:50:08PM -0400, Bruce Momjian wrote: I think we need to see a patch from Kevin that shows all the row comparisons documented and we can see the impact of the user-visibile part. One interesting approach would be to only allow the operator to be called from SPI queries. It would also be good to know about similar non-default entries in pg_opclass so we can understand the expected impact. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] bitmap indexes
At 2013-09-26 08:39:05 -0700, jeff.ja...@gmail.com wrote: I don't know about grottiness, but it certainly seems like it would deadlock like crazy. Hi Jeff. I don't understand the deadlock scenario you're thinking of. Could you explain, please? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] backup.sgml-cmd-v003.patch
On 09/26/2013 12:15:25 PM, Ivan Lezhnjov IV wrote: On Sep 3, 2013, at 6:56 AM, Karl O. Pinc k...@meme.com wrote: On 07/31/2013 12:08:12 PM, Ivan Lezhnjov IV wrote: Patch filename: backup.sgml-cmd-v003.patch The third version of this patch takes into consideration feedback received after original submission (it can be read starting from this message http://www.postgresql.org/message-id/CA +Tgmoaq-9D_mst113TdW=ar8mpgbc+x6t61azk3emhww9g...@mail.gmail.com) Essentially, it addresses the points that were raised in community feedback and offers better worded statements that avoid implying that some features are being deprecated when it isn't the case. We also spent some more time polishing other details, like making adjustments to the tone of the text so that it sounds more like a manual, and less like a blog post. More importantly, this chapter now makes it clear that superuser privileges are not always required to perform a successful backup because in practice as long as the role used to make a backup has sufficient read privileges on all of the objects a user is interested in it's going to work just fine. We also mention and show examples of usage for pg_restore and pigz alongside with gzip, and probably something else too. --- Cleaned up and clarified here and there. The bit about OIDs being depreciated might properly belong in a separate patch. The same might be said about adding mention of pigz. If you submit these as separate patch file attachments they can always be applied in a single commit, but the reverse is more work for the committer. (Regardless, I see no reason to have separate commitfest entries or anything other than multiple attachments to the email that finalizes our discussion.) Hello, took me a while to get here, but a lot has been going on... No worries. Okay, I'm new and I don't know why a single patch like this is more work for a commiter? Just so I understand and know. Different committers have different preferences but the general rule is that it's work to split a patch into pieces if you don't like the whole thing but it's easy to apply a bunch of small patches and commit them all at once. Further each commit should represent a single feature or conceptual change. Again, preferences vary but I like to think that a good rule is that 1 commit should be able to be described in a sentence, and not a run-on sentence either that says I did this and I also did that and something else. So if there's a question in your mind about whether a committer will want your entire change, or if your patch changes unrelated things then it does not hurt to submit it as separate patches. All the patches can be attached to a single email and part of a single commitfest tracking entry, usually. No need to get crazy. These are just things to think about. In your case I see 3 things happening: oid depreciation custom format explanation pigz promotion My thought is that the part beginning with The options in detail are: should not describe all the possibilities for the --format option, that being better left to the reference section. Likewise, this being prose, it might be best to describe all the options in-line, instead of presented as a list. I have left it as-is for you to improve as seen fit. Agreed, it probably looks better as a sentence. Looks good. I have frobbed your programlisting to adjust the indentation and line-wrap style. I submit it here for consideration in case this style is attractive. This is nothing but conceit. We should use the same style used elsewhere in the documentation. Looks good to me. I fixed the missing \ I messed up on last time and slightly re-worded the previous sentence. I've grep-ed through the sgml looking for multi-line shell scripts and found only 1 (in sepgsql.sgm). I don't see a conflict with the formatting/line-break convention used in the patch, although it does differ slightly in indentation. I'm leaving the shell script formatting in the patch as-is for the committer to judge. (I like the way it looks but it is not a traditional style.) I don't know that it's necessary to include pigz examples, because it sounds like pigz is a drop-in gzip replacement. I've left your examples in, in case you feel they are necessary. We do. We believe it can encourage more people to consider using it. The way we see it, most people seem to be running mutlicore systems these days, yet many simply are not aware of pigz. Ok. It's your patch. The existing text of the SQL Dump section could use some alteration to reduce redundancy and add clarity. I'm thinking specifically of mention of pg_restore as being required to restore custom format backups and of the default pg_dump output being not just plain text but being a collection of SQL commands. Yes, the latter is obvious upon reflection,
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote: Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. I think I agree with that. I don't think I remember hearing that proposed. I think I might have been unclear - I mean locks that are held for the duration of *another* transaction, not our own, as we wait for that other transaction to commit/abort. I think that earlier remarks from yourself and Andres implied that this would be necessary. Perhaps I'm mistaken. Your most recent design proposal doesn't do this, but I think that that's only because it restricts the user to a single unique index - it would otherwise be necessary to sit on the earlier value locks (index tuples belonging to an unfinished transaction) pending the completion of some other conflicting transaction, which has numerous disadvantages (as described in my it suits my purposes to have the value locks be held for only an instant mail to you [1]). If we're really lucky, maybe the value locking stuff can be generalized or re-used as part of a btree index insertion buffer feature. Well, that would be nifty. Yes, it would. I think, based on a conversation with Rob Wultsch, that it's another area where MySQL still do quite a bit better. I see. Well, we could try to mimic their semantics, I suppose. Those semantics seem like a POLA violation to me; who would have thought that a REPLACE could delete multiple tuples? But what do I know? I think that it's fairly widely acknowledged to not be very good. Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead. The only way that's going to work is if you say use this unique index, which will look pretty gross in DML. Yeah, it's kind of awful. It is. What definition of equality or inequality? Binary equality, same as we'd use to decide whether an update can be done HOT. I guess that's acceptable in theory, because binary equality is necessarily a *stricter* condition than equality according to some operator that is an equivalence relation. But the fact remains that you're just ameliorating the problem by making it happen less often (both through this kind of trick, but also by restricting us to one unique index), not actually fixing it. Well, there are two separate issues here: what to do about MVCC, and how to do the locking. Totally agreed. Fortunately, unlike the different aspects of value and row locking, I think that these two questions can be reasonable considered independently. From an MVCC perspective, I can think of only two behaviors when the conflicting tuple is committed but invisible: roll back, or update it despite it being invisible. If you're saying you don't like either of those choices, I couldn't agree more, but I don't have a third idea. If you do, I'm all ears. I don't have another idea either. In fact, I'd go so far as to say that doing any third thing that's better than those two to any reasonable person is obviously impossible. But I'd add that we simple cannot rollback at read committed, so we're just going to have to hold our collective noses and do strange things with visibility. FWIW, I'm tentatively looking at doing something like this: *** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 958,963 --- 959,975 * By here, the inserting transaction has committed - have to check * when... */ + + /* + * Not necessarily visible to snapshot under conventional MVCC rules, but + * still locked by our xact and not updated -- importantly, normal MVCC + * semantics apply when we update the row, so only one version will be + * visible at once + */ + if (HEAP_XMAX_IS_LOCKED_ONLY(tuple-t_infomask) + TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) + return true; + if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot)) return false; /* treat as still in progress */ This is something that I haven't given remotely enough thought yet, so please take it with a big grain of salt. In terms of how to do the locking, what I'm mostly saying is that we could try to implement this in a way that invents as few new concepts as possible. No promise tuples, no new SLRU, no new page-level bits, just index tuples and heap tuples and so on. Ideally, we don't even change the WAL format, although step 2 might require a new record type. To the extent that what I actually described was at variance with that goal, consider it a defect in my explanation rather than an intent to vary. I think there's value in considering such an implementation because each new thing that we have to introduce in order to get this feature is a possible reason for it to be rejected - for modularity reasons, or because it hurts performance elsewhere, or because it's more code we have to maintain, or whatever. There is certainly value in considering that, and you're right to take that tact - it is
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 12:33 PM, Robert Haas robertmh...@gmail.com wrote: I think one thing that's pretty clear at this point is that almost any version of this feature could be optimized for either the insert case or the update case. For example, my proposal could be modified to search for a conflicting tuple first, potentially wasting an index probes (or multiple index probes, if you want to search for potential conflicts in multiple indexes) if we're inserting, but winning heavily in the update case. I don't think that's really the case. In what sense could my design really be said to prioritize either the INSERT or the UPDATE case? I'm pretty sure that it's still necessary to get all the value locks per unique index needed up until the first one with a conflict even if you know that you're going to UPDATE for *some* reason, in order for things to be well defined (which is important, because there might be more than one conflict, and which one is locked matters - maybe we could add DDL to let unique indexes have a checking priority or something like that). The only appreciable downside of my design for updates that I can think of is that there has to be another index scan, to find the locked-for-update row to update. However, that's probably worth it, since it is at least relatively rare, and allows the user the flexibility of using a more complex UPDATE predicate than apply to conflicter, which is something that the MySQL syntax effectively limits users to. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers