date:20130926

On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
  2) I don't think the drop algorithm used now is correct. Your
  index_concurrent_set_dead() sets both indisvalid = false and indislive =
  false at the same time. It does so after doing a WaitForVirtualLocks() -
  but that's not sufficient. Between waiting and setting indisvalid =
  false another transaction could start which then would start using that
  index. Which will not get updated anymore by other concurrent backends
  because of inislive = false.
  You really need to follow index_drop's lead here and first unset
  indisvalid then wait till nobody can use the index for querying anymore
  and only then unset indislive.

 Sorry, I do not follow you here. index_concurrent_set_dead calls
 index_set_state_flags that sets indislive and *indisready* to false,
 not indisvalid. The concurrent index never uses indisvalid = true so
 it can never be called by another backend for a read query. The drop
 algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.

That makes it even worse... You can do the concurrent drop only in the
following steps:
1) set indisvalid = false, no future relcache lookups will have it as valid
2) now wait for all transactions that potentially still can use the index for
   *querying* to finish. During that indisready *must* be true,
   otherwise the index will have outdated contents.
3) Mark the index as indislive = false, indisready = false. Anything
   using a newer relcache entry will now not update the index.
4) Wait till all potential updaters of the index have finished.
5) Drop the index.

With the patch's current scheme concurrent queries that use plans using
the old index will get wrong results (at least in read committed)
because concurrent writers will not update it anymore since it's marked
indisready = false.
This isn't a problem of the *new* index, it's a problem of the *old*
one.

Am I missing something?

  3) I am not sure if the swap algorithm used now actually is correct
  either. We have mvcc snapshots now, right, but we're still potentially
  taking separate snapshot for individual relcache lookups. What's
  stopping us from temporarily ending up with two relcache entries with
  the same relfilenode?
  Previously you swapped names - I think that might end up being easier,
  because having names temporarily confused isn't as bad as two indexes
  manipulating the same file.

 Actually, performing swap operation with names proves to be more
 difficult than it looks as it makes necessary a moment where both the
 old and new indexes are marked as valid for all the backends.

But that doesn't make the current method correct, does it?

 The only
 reason for that is that index_set_state_flag assumes that a given xact
 has not yet done any transactional update when it is called, forcing
 to one the number of state flag that can be changed inside a
 transaction. This is a safe method IMO, and we shouldn't break that.

Part of that reasoning comes from the non-mvcc snapshot days, so it's
not really up to date anymore.
Even if you don't want to go through all that logic - which I'd
understand quite well - you can just do it like:
1) start with: old index: valid, ready, live; new index: invalid, ready, live
2) one transaction: switch names from real_name = tmp_name, new_name =
  real_name
3) one transaction: mark real_name (which is the rebuilt index) as valid,
and new_name as invalid

Now, if we fail in the midst of 3, we'd have two indexes marked as
valid. But that's unavoidable as long as you don't want to use
transactions.
Alternatively you could pass in a flag to use transactional updates,
that should now be safe.

At least, unless the old index still has indexcheckxmin = true with an
xmin that's not old enough. But in that case we cannot do the concurrent
reindex at all I think since we rely on the old index to be full valid.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-09-26 Thread Michael Paquier

On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
  2) I don't think the drop algorithm used now is correct. Your
  index_concurrent_set_dead() sets both indisvalid = false and indislive =
  false at the same time. It does so after doing a WaitForVirtualLocks() -
  but that's not sufficient. Between waiting and setting indisvalid =
  false another transaction could start which then would start using that
  index. Which will not get updated anymore by other concurrent backends
  because of inislive = false.
  You really need to follow index_drop's lead here and first unset
  indisvalid then wait till nobody can use the index for querying anymore
  and only then unset indislive.

 Sorry, I do not follow you here. index_concurrent_set_dead calls
 index_set_state_flags that sets indislive and *indisready* to false,
 not indisvalid. The concurrent index never uses indisvalid = true so
 it can never be called by another backend for a read query. The drop
 algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.

 That makes it even worse... You can do the concurrent drop only in the
 following steps:
 1) set indisvalid = false, no future relcache lookups will have it as valid
indisvalid is never set to true for the concurrent index. Swap is done
with concurrent index having indisvalid = false and former index with
indisvalid = true. The concurrent index is validated with
index_validate in a transaction before swap transaction.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - exclude pthread_create() from connection start timing

2013-09-26 Thread Fabien COELHO




pgbench changes, when adding the throttling stuff. Having the start time
taken when the thread really starts is just sanity, and I needed that
just to rule out that it was the source of the strange measures.


I don't get it; why is taking the time just after pthread_create() more sane
than taking it just before pthread_create()?


Thread create time seems to be expensive as well, maybe up 0.1 seconds 
under some conditions (?). Under --rate, this create delay means that 
throttling is laging behind schedule by about that time, so all the first 
transactions are trying to catch up.



typically far more expensive than pthread_create().  The patch for threaded
pgbench made the decision to account for pthread_create() as though it were
part of establishing the connection.  You're proposing to not account for it
all.  Both of those designs are reasonable to me, but I do not comprehend the
benefit you anticipate from switching from one to the other.


-j 800 vs -j 100 : ITM that if I you create more threads, the time delay
incurred is cumulative, so the strangeness of the result should worsen.


Not in general; we do one INSTR_TIME_SET_CURRENT() per thread, just before
calling pthread_create().  However, thread 0 is a special case; we set its
start time first and actually start it last.  Your observation of cumulative
delay fits those facts.


Yep, that must be thread 0 which has a very large delay. I think it is 
simpler that each threads record its start time when it has started, 
without exception.


 Initializing the thread-0 start time later, just before calling its 
threadRun(), should clear this anomaly without changing other aspects of 
the measurement.


Always taking the thread start time when the thread is started does solve 
the issue as well, and it is homogeneous for all cases, so the solution I 
suggest seems reasonable and simple.


While pondering this area of the code, it occurs to me -- shouldn't we 
initialize the throttle rate trigger later, after establishing 
connections and sending startup queries?  As it stands, we build up a 
schedule deficit during those tasks.  Was that deliberate?


On the principle, I agree with you.

The connection creation time is another thing, but it depends on the 
options set. Under some options the connection is open and closed for 
every transaction, so there is no point in avoiding it in the measure or 
in the scheduling, and I want to avoid having to distinguish those cases. 
Morover, ISTM that one of the thread reuse the existing connection while 
other recreate is. So I left it as is.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

On 2013-09-26 20:40:40 +0900, Michael Paquier wrote:
 On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
   2) I don't think the drop algorithm used now is correct. Your
   index_concurrent_set_dead() sets both indisvalid = false and indislive =
   false at the same time. It does so after doing a WaitForVirtualLocks() -
   but that's not sufficient. Between waiting and setting indisvalid =
   false another transaction could start which then would start using that
   index. Which will not get updated anymore by other concurrent backends
   because of inislive = false.
   You really need to follow index_drop's lead here and first unset
   indisvalid then wait till nobody can use the index for querying anymore
   and only then unset indislive.
 
  Sorry, I do not follow you here. index_concurrent_set_dead calls
  index_set_state_flags that sets indislive and *indisready* to false,
  not indisvalid. The concurrent index never uses indisvalid = true so
  it can never be called by another backend for a read query. The drop
  algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.
 
  That makes it even worse... You can do the concurrent drop only in the
  following steps:
  1) set indisvalid = false, no future relcache lookups will have it as valid

 indisvalid is never set to true for the concurrent index. Swap is done
 with concurrent index having indisvalid = false and former index with
 indisvalid = true. The concurrent index is validated with
 index_validate in a transaction before swap transaction.

Yes. I've described how it *has* to be done, not how it's done.

The current method of going straight to isready = false for the original
index will result in wrong results because it's not updated anymore
while it's still being used.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
 On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian br...@momjian.us wrote:
  This thread had a lot of discussion about bloating.  I wonder, does the
  code check to see if there is a matching row _before_ adding any data?
 
 That's pretty much what the patch does.

So, I guess my question is if we are only bloating on a contended
operation, do we expect that to happen so much that bloat is a problem?

I think the big objection to the patch is the additional code complexity
and the potential to slow down other sessions.  If it is only bloating
on a contended operation, are these two downsides worth avoiding the
bloat?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-09-26 Thread Michael Paquier

On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-26 20:40:40 +0900, Michael Paquier wrote:
 On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
   2) I don't think the drop algorithm used now is correct. Your
   index_concurrent_set_dead() sets both indisvalid = false and indislive =
   false at the same time. It does so after doing a WaitForVirtualLocks() -
   but that's not sufficient. Between waiting and setting indisvalid =
   false another transaction could start which then would start using that
   index. Which will not get updated anymore by other concurrent backends
   because of inislive = false.
   You really need to follow index_drop's lead here and first unset
   indisvalid then wait till nobody can use the index for querying anymore
   and only then unset indislive.
 
  Sorry, I do not follow you here. index_concurrent_set_dead calls
  index_set_state_flags that sets indislive and *indisready* to false,
  not indisvalid. The concurrent index never uses indisvalid = true so
  it can never be called by another backend for a read query. The drop
  algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.
 
  That makes it even worse... You can do the concurrent drop only in the
  following steps:
  1) set indisvalid = false, no future relcache lookups will have it as valid

 indisvalid is never set to true for the concurrent index. Swap is done
 with concurrent index having indisvalid = false and former index with
 indisvalid = true. The concurrent index is validated with
 index_validate in a transaction before swap transaction.

 Yes. I've described how it *has* to be done, not how it's done.

 The current method of going straight to isready = false for the original
 index will result in wrong results because it's not updated anymore
 while it's still being used.
The index being dropped at the end of process is not the former index,
but the concurrent index. The index used after REINDEX CONCURRENTLY is
the old index but with the new relfilenode.

Am I lacking of caffeine? It looks so...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-26 Thread Pavan Deolasee

On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao masao.fu...@gmail.com wrote:



 Hmm... when synchronous_transfer is set to data_flush,
 IMO the intuitive behaviors are

 (1) synchronous_commit = on
 A data flush should wait for the corresponding WAL to be
 flushed in the standby

 (2) synchronous_commit = remote_write
 A data flush should wait for the corresponding WAL to be
 written to OS in the standby.

 (3) synchronous_commit = local
 (4) synchronous_commit = off
 A data flush should wait for the corresponding WAL to be
 written locally in the master.


I thought synchronous_commit and synchronous_transfer are kind of
orthogonal to each other. synchronous_commit only controls whether and how
to wait for the standby only when a transaction commits.
synchronous_transfer OTOH tells how to interpret the standby listed in
synchronous_standbys parameter. If set to commit then they are
synchronous standbys (like today). If set to data_flush, they are
asynchronous failback safe standby and if set to all then they are
synchronous failback safe standbys. Well, its confusing :-(

So IMHO in the current state of things, the synchronous_transfer GUC can
not be changed at a session/transaction level since all backends, including
background workers must honor the settings to guarantee failback safety.
synchronous_commit still works the same way, but is ignored if
synchronous_transfer is set to data_flush because that effectively tells
us that the standbys listed under synchronous_standbys are really *async*
standbys with failback safety.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Support for REINDEX CONCURRENTLY

On 2013-09-26 20:47:33 +0900, Michael Paquier wrote:
 On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-09-26 20:40:40 +0900, Michael Paquier wrote:
  On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
2) I don't think the drop algorithm used now is correct. Your
index_concurrent_set_dead() sets both indisvalid = false and 
indislive =
false at the same time. It does so after doing a 
WaitForVirtualLocks() -
but that's not sufficient. Between waiting and setting indisvalid =
false another transaction could start which then would start using 
that
index. Which will not get updated anymore by other concurrent backends
because of inislive = false.
You really need to follow index_drop's lead here and first unset
indisvalid then wait till nobody can use the index for querying 
anymore
and only then unset indislive.
  
   Sorry, I do not follow you here. index_concurrent_set_dead calls
   index_set_state_flags that sets indislive and *indisready* to false,
   not indisvalid. The concurrent index never uses indisvalid = true so
   it can never be called by another backend for a read query. The drop
   algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.
  
   That makes it even worse... You can do the concurrent drop only in the
   following steps:
   1) set indisvalid = false, no future relcache lookups will have it as 
   valid
 
  indisvalid is never set to true for the concurrent index. Swap is done
  with concurrent index having indisvalid = false and former index with
  indisvalid = true. The concurrent index is validated with
  index_validate in a transaction before swap transaction.
 
  Yes. I've described how it *has* to be done, not how it's done.
 
  The current method of going straight to isready = false for the original
  index will result in wrong results because it's not updated anymore
  while it's still being used.

 The index being dropped at the end of process is not the former index,
 but the concurrent index. The index used after REINDEX CONCURRENTLY is
 the old index but with the new relfilenode.

That's not relevant unless I miss something.

After phase 4 both indexes are valid (although only the old one is
flagged as such), but due to the switching of the relfilenodes backends
could have either of both open, depending on the time they built the
relcache entry. Right?
Then you go ahead and mark the old index - which still might be used! -
as dead in phase 5. Which means other backends might (again, depending
on the time they have built the relcache entry) not update it
anymore. In read committed we very well might go ahead and use the index
with the same plan as before, but with a new snapshot. Which now will
miss entries.

Am I misunderstanding the algorithm you're using?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] setting separate values of replication parameters to each standby to provide more granularity

2013-09-26 Thread Samrat Revagade

Hi,

How about providing more granularity to replication, by setting separate
values of replication parameters to each standby
for example:
standby1.wal_sender_timeout= 50s
standby2.wal_sender_timeout= 40s

The idea is to allow configuration of standby servers such that  they have
there own set of replication parameters as per requirements.
If they are not specified then configure standby with default settings.

I have already raised this point when I made proposal about introducing the
ini file, but there was no feedback on this point (setting separate values
to each standby) here is the link for that:
http://www.postgresql.org/message-id/caf8q-gxvg38m4vvghuntw4nvu69byfj5n--qmounc-bwv8r...@mail.gmail.com

As per discussion use of ini file is not necessary. Currently PostgreSQL
provides infrastructure which allows level-2 nesting
( If we need further nesting of parameters we can use Andres patch
http://www.postgresql.org/message-id/20130225211533.gd3...@awork2.anarazel.de
)
we can do setting as follows:
--postgresql.conf--
st1.wal_sender_timeout = 40s
st2.wal_sender_timeout = 50s
---
postgres=# show st1.wal_sender_timeout ;
 st1.wal_sender_timeout

 40s
(1 row)

postgres=# show st2.wal_sender_timeout ;
 st2.wal_sender_timeout

 50s
(1 row)

But this just handles the parser stage, we need to write the underlying
code which actually configures standby server with corresponding parameter
values.

I think that use of postgresql.conf is much better option to do this.
one can easily do argument about the size and complexity of postgresql.conf
if it allows such setting.
In that case we can use *include* directive and separate out replication
parameters to another sub file.
But after all of this, I think it will be a good change and will provide
more granularity to the replication.

Greetings,
Samrat Revagade,
NTT DATA OSS Center Pune, India.

Re: [HACKERS] Patch for fast gin cache performance improvement

2013-09-26 Thread Etsuro Fujita

Hi Ian,

 This patch contains a performance improvement for the fast gin cache. As you
 may know, the performance of the fast gin cache decreases with its size.
 Currently, the size of the fast gin cache is tied to work_mem. The size of
 work_mem can often be quite high. The large size of work_mem is inappropriate
 for the fast gin cache size. Therefore, we created a separate cache size
called
 gin_fast_limit. This global variable controls the size of the fast gin cache,
 independently of work_mem. Currently, the default gin_fast_limit is set to
128kB.
 However, that value could need tweaking. 64kB may work better, but it's hard
 to say with only my single machine to test on.

 On my machine, this patch results in a nice speed up. Our test queries improve
 from about 0.9 ms to 0.030 ms. Please feel free to use the test case yourself:
 it should be attached. I can look into additional test cases (tsvectors) if
 anyone is interested.

 In addition to the global limit, we have provided a per-index limit:
 fast_cache_size. This per-index limit begins at -1, which means that it is
 disabled. If the user does not specify a per-index limit, the index will
simply
 use the global limit.

I had a look over this patch.  I think this patch is interesting and very
useful.  Here are my review points:

1. Patch applies cleanly.
2. make, make install and make check is good.
3. I did performance evaluation using your test queries with 64kB and 128kB of
gin_fast_limit (or fast_cache_size), and saw that both values achieved the
performance gains over gin_fast_limit = '256MB'.  64kB worked better than 128kB.
64kB improved from 1.057 ms to 0.075 ms.  Great!
4. In my understanding, the small value of gin_fast_limit/fast_cache_size leads
to the increase in GIN search performance, which, however, leads to the decrease
in GIN update performance.  Am I right?  If so, I think the tradeoff should be
noted in the documentation.
5. The following documents in Chapter 57. GIN Indexes need to be updated:
 * 57.3.1. GIN Fast Update Technique
 * 57.4. GIN Tips and Tricks
6. I would like to see the results for the additional test cases (tsvectors).
7. The commented-out elog() code should be removed.
8. I think there are no issues in this patch.  However, I have one question: how
this patch works in the case where gin_fast_limit/fast_cache_size = 0?  In this
case, in my understanding, this patch inserts new entries into the pending list
temporarily and immediately moves them to the main GIN data structure using
ginInsertCleanup().  Am I right?  If so, that is obviously inefficient.

Sorry for the delay.

Best regards,
Etsuro Fujita




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Thu, Sep 26, 2013 at 07:43:15AM -0400, Bruce Momjian wrote:
 On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
  On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian br...@momjian.us wrote:
   This thread had a lot of discussion about bloating.  I wonder, does the
   code check to see if there is a matching row _before_ adding any data?
  
  That's pretty much what the patch does.
 
 So, I guess my question is if we are only bloating on a contended
 operation, do we expect that to happen so much that bloat is a problem?
 
 I think the big objection to the patch is the additional code complexity
 and the potential to slow down other sessions.  If it is only bloating
 on a contended operation, are these two downsides worth avoiding the
 bloat?

Also, this isn't like the case where we are incrementing sequences --- I
am unclear what workload is going to cause a lot of contention.  If two
sessions try to insert the same key, there will be bloat, but later
upsert operations will already see the insert and not cause any bloat.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory

2013-09-26 Thread Noah Misch

On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
 On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund and...@2ndquadrant.com wrote:
  Actually, I was wondering if we ought
  to have a directory under pgdata whose explicit charter it was to
  contain files that shouldn't be copied as part of a base backup.
  pg_do_not_back_this_up.
 
  Wondered exactly about that as soon as you've mentioned
  pg_basebackup. pg_local/?
 
 That seems reasonable.  It's not totally transparent what that's
 supposed to mean, but it's fairly mnemonic once you know.  Other
 suggestions welcome, if anyone has ideas.

I like the concept and have no improvements on the name.

 Are there any other likely candidates for inclusion in that directory
 other than this stuff?

pg_xlog

   Not sure whether it's sensible to only LOG in these cases. After all
   there's something unexpected happening. The robustness argument doesn't
   count since we're already shutting down.
 
  I see no point in throwing an error.   The fact that we're having
  trouble cleaning up one dynamic shared memory segment doesn't mean we
  shouldn't try to clean up others, or that any remaining postmaster
  shutdown hooks shouldn't be executed.
 
  Well, it means we'll do a regular shutdown instead of PANICing
  and *not* writing a checkpoint.
  If something has corrupted our state to the point we cannot unregister
  shared memory we registered, something has gone terribly wrong. Quite
  possibly we've scribbled over our control structures or such. In that
  case it's not proper to do a normal shutdown, we're quite possibly
  writing bad data.
 
 I have to admit I didn't consider the possibility of an
 otherwise-clean shutdown that hit only this problem.  I'm not sure how
 seriously to take that case.  I guess we could emit warnings for
 individual failures and then throw an error at the end if there were 
 0, but that seems a little ugly.  Or we could go whole hog and treat
 any failure as a critical error.  Anyone else have an opinion on what
 to do here?

There's extensive precedent in our code for LOG, WARNING, or even ignoring the
return value of unlink().  (To my surprise, ignoring the return value is the
most popular choice.)  Of the dozens of backend callers, here is the mixed bag
that actually raises ERROR or better:

do_pg_stop_backup()
RestoreArchivedFile()
KeepFileRestoredFromArchive()
create_tablespace_directories() [remove old symlink during recovery]
destroy_tablespace_directories()
RelationCacheInitFilePreInvalidate()
CreateLockFile()

I think it's awfully unlikely that runaway code would corrupt shared_buffers
AND manage to make an unlink() fail.

   Imo that's a PANIC or at the very least a FATAL.
 
  Sure, that's a tempting option, but it doesn't seem to serve any very
  necessary point.  There's no data corruption problem if we proceed
  here.  Most likely either (1) there's a bug in the code, which
  panicking won't fix or (2) the DBA hand-edited the state file, in
  which case maybe he shouldn't have done that, but if he thinks the
  best way to recover from that is a cluster-wide restart, he can do
  that himself.
 
  There's no data corruption problem if we proceed - but there likely
  has been one leading to the current state.

+1 for making this one a PANIC, though.  With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents.  The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench filler columns

2013-09-26 Thread Noah Misch

On Thu, Sep 26, 2013 at 03:23:30PM +0530, Pavan Deolasee wrote:
On Thu, Sep 26, 2013 at 2:05 PM, Pavan Deolasee
pavan.deola...@gmail.comwrote:

While looking at the compressibility of WAL files generated by pgbench,
which is close to 90%, I first thought its because of the filler column
in the accounts table. But a comment in pgbench.c says:

/*
* Note: TPC-B requires at least 100 bytes per row, and the filler
* fields in these table declarations were intended to comply with
that.
* But because they default to NULLs, they don't actually take any
space.
* We could fix that by giving them non-null default values. However,
that
* would completely break comparability of pgbench results with prior
* versions. Since pgbench has never pretended to be fully TPC-B
* compliant anyway, we stick with the historical behavior.
*/

The comment about them being NULL and hence not taking up any space is
added by commit b7a67c2840f193f in response to this bug report:

http://www.postgresql.org/message-id/200710170810.l9h8a76i080...@wwwmaster.postgresql.org

On a more careful look, it seems the original bug report complained about
all tables except accounts. And all other tables indeed have filler as
NULL. But the way comment is written it seems as if it applies to all DDLs.

Agreed.

Should we just fix the comment and say its applicable for all tables except
accounts ?

Please do.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] record identical operator

On Tue, Sep 24, 2013 at 3:22 PM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 Now I admit that's an arguable point.  We could certainly define an
 intermediate notion of equality that is more equal than whatever =
 does, but not as equal as exact binary equality.

 I suggested it up-thread and don't recall seeing a response, so here it
 is again- passing the data through the binary-out function for the type
 and comparing *that* would allow us to change the interal binary
 representation of data types and would be something which we could at
 least explain to users as to why X isn't the same as Y according to this
 binary operator.

Sorry, I missed that suggestion.

I'm wary of inventing a completely new way of doing this.  I don't
think that there's any guarantee that the send/recv functions won't
expose exactly the same implementation details as a direct check for
binary equality.  For example, array_send() seems to directly reveal
the presence or absence of a NULL bitmap.  Even if there were no such
anomalies today, it feels fragile to rely on a fairly-unrelated
concept to have exactly the semantics we want here, and it will surely
be much slower.  Binary equality has existing precedent and is used in
numerous places in the code for good reason.  Users might be confused
about the use of those semantics in those places also, but AFAICT
nobody is.

I think the best thing I can say about this proposal is that it would
clearly be better than what we're doing now, which is just plain
wrong.  I don't think it's the best proposal, however.

 It is obviously true, and unavoidable, that if the query can produce
 more than one result set depending on the query plan or other factors,
 then the materialized view contents can match only one of those
 possible result sets.  But you are arguing that it should be allowed
 to produce some OTHER result set, that a direct execution of the query
 could *never* have produced, and I can't see how that can be right.

 I agree that the matview shouldn't produce things which *can't* exist
 through an execution of the query.  I don't intend to argue against that
 but I realize that's a fallout of not accepting the patch to implement
 the binary comparison operators.  My complaint is more generally that if
 this approach needs such then there's going to be problems down the
 road.  No, I can't predict exactly what they are and perhaps I'm all wet
 here, but this kind of binary-equality operations are something I've
 tried to avoid since discovering what happens when you try to compare
 a couple of floating point numbers.

So, I get that.

But what I think is that the problem that's coming up here is almost
the flip side of that.  If you are working with types that are a
little bit imprecise, such as floats or citext or box, you want to use
comparison strategies that have a little bit of tolerance for error.
In the case of box, this is actually built into the comparison
operator.  In the case of floats, it's not; you as the application
programmer have to deal with the fact that comparisons are imprecise -
like by avoiding equality comparisons.

On the other hand, if you are *replicating* those data types, then you
don't want that tolerance.  If you're checking whether two boxes are
equal, you may indeed want the small amount of fuzziness that our
comparison operators allow.  But if you're copying a box or a float
from one table to another, or from one database to another, you want
the values copied exactly, including all of those low-order bits that
tend to foul up your comparisons.  That's why float8out() normally
doesn't display any extra_float_digits - because you as the user
shouldn't be relying on them - but pg_dump does back them up because
not doing so would allow errors to propagate.  Similarly here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] bitmap indexes

2013-09-26 Thread Jeff Janes

On Wed, Sep 25, 2013 at 3:20 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 Hi,

 Here are some quick items while skimming this patch.  I am looking at
 commit 6448de29d from your github repo, branch bmi.

 What's with the pg_bitmapindex stuff in pg_namespace.h?  It doesn't seem
 to be used anywhere.

 This led me to research how these indexes are stored.  I note that what
 we're doing here is to create another regular table and a btree index on
 top of it, and those are the ones that actually store the index data.
 This seems grotty and completely unlike the way we do things elsewhere
 (compare GIN indexes which have rbtrees inside them).



+1 on that.  I don't know about grottiness, but it certainly seems like it
would deadlock like crazy.  Which another product's bitmap indexes are
infamous for, but since we don't need to store visibility information in
our indexes, hopefully we can do better.

Cheers,

Jeff

Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix

On Wed, Sep 25, 2013 at 4:46 AM, David Rowley dgrowle...@gmail.com wrote:
 Ok, I think I've managed to narrow the performance gap to just about nothing
 but noise, though to do this the code is now a bit bigger. I've added a
 series of tests to see if the padding is  0 and if it's not then I'm doing
 things the old way.

 I've also added a some code which does a fast test to see if it is worth
 while calling the padding processing function. This is just a simple if (*p
 = '9'), I'm not completely happy with that as it does look a bit weird, but
 to compensate I've added a good comment to explain what it is doing.

 Please find attached the new patch ... version v0.5 and also updated
 benchmark results.

Are you sure this is the right set of benchmark results?  This still
reflects a 15-18% slowdown AFAICS.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Here's an updated version of this patch, with fixes to all the bugs
 reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
 Amit Kapila for the reports.

I'm not very happy with the use of a separate relation fork for
storing this data.  Using an existing fork number rather than creating
a new one avoids some of them (like, the fact that we loop over all
known fork numbers in various places, and adding another one will add
latency in all of those places, particularly when there is a system
call in the loop) but not all of them (like, what happens if the index
is unlogged?  we have provisions to reset the main fork but any others
are just removed; is that OK?), and it also creates some new ones
(like, files having misleading names).

More generally, I fear we really opened a bag of worms with this
relation fork stuff.  Every time I turn around I run into a problem
that could be solved by adding another relation fork.  I'm not
terribly sure that it was a good idea to go that way to begin with,
because we've got customers who are unhappy about 3 files/heap due to
inode consumption and slow directory lookups.  I think we would have
been smarter to devise a strategy for storing the fsm and vm pages
within the main fork in some fashion, and I tend to think that's the
right solution here as well.  Of course, it may be hopeless to put the
worms back in the can at this point, and surely these indexes will be
lightly used compared to heaps, so it's not incrementally exacerbating
the problems all that much.  But I still feel uneasy about widening
use of that mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Extra functionality to createuser

2013-09-26 Thread Christopher Browne

Sitting on my todo list for a while has been to consider the idea of
adding a bit of additional functionality to createuser.

One of the functions of CREATE ROLE is to associate the role with
other roles, thus...

   create role my_new_user nosuperuser nocreatedb login
IN ROLE app_readonly_role, app2_writer_role;

That isn't something that I can do using createuser; to do that, I
would need to submit two requests separately:

   PGUSER=postgres createuser -D -S -l my_new_user
   PGUSER=postgres psql -c grant app_readonly_role, app2_writer_role
to my_new_user;

I could certainly change over to using psql to do all the work, but it
would be rather nice if createuser had (say) a -g option which
allowed specifying the set of roles that should be assigned.

Thus, the above commands might be replaced by:
   PGUSER=postgres createuser -D -S -l -g
app_readonly_role,app2_writer_role my_new_user

Would this be worth adding to the ToDo list?
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-26 Thread Ivan Lezhnjov IV


On Sep 3, 2013, at 6:56 AM, Karl O. Pinc k...@meme.com wrote:

 On 07/31/2013 12:08:12 PM, Ivan Lezhnjov IV wrote:
 
 Patch filename: backup.sgml-cmd-v003.patch
 
 The third version of this patch takes into consideration feedback
 received after original submission (it can be read starting from this
 message http://www.postgresql.org/message-id/CA
 +Tgmoaq-9D_mst113TdW=ar8mpgbc+x6t61azk3emhww9g...@mail.gmail.com)
 
 Essentially, it addresses the points that were raised in community
 feedback and offers better worded statements that avoid implying that
 some features are being deprecated when it isn't the case. We also
 spent some more time polishing other details, like making adjustments
 to the tone of the text so that it sounds more like a manual, and 
 less
 like a blog post. More importantly, this chapter now makes it clear
 that superuser privileges are not always required to perform a
 successful backup because in practice as long as the role used to 
 make
 a backup has sufficient read privileges on all of the objects a user
 is interested in it's going to work just fine. We also mention and
 show examples of usage for pg_restore and pigz alongside with gzip,
 and probably something else too.
 
 Hi Ivan,
 
 I'm reviewing your patch.  I did not read the entirety of the
 thread referenced above.  I apologize if this causes problems.
 
 Attached is backup-sgml-cmnd-v003_1.patch to be applied on top of
 backup-sgml-cmnd-v003.patch and containing my edits.  You will
 eventually want to produce a single patch (but see below).
 Meanwhile this might help you keep track of my changes.
 
 Attached also is your original v3 patch.
 
 ---
 
 Cleaned up and clarified here and there.
 
 The bit about OIDs being depreciated might properly belong in 
 a separate patch.  The same might be said about adding mention of pigz.
 If you submit these as separate patch file attachments
 they can always be applied in a single commit, but the reverse is 
 more work for the committer.  (Regardless, I see no reason to
 have separate commitfest entries or anything other than multiple
 attachments to the email that finalizes our discussion.)

Hello,

took me a while to get here, but a lot has been going on...

Okay, I'm new and I don't know why a single patch like this is more work for a 
commiter? Just so I understand and know.

 
 Minimally modified to note the existence of directory dumps.  It may
 be that the utility/flexibility of directory dumps should also be
 mentioned.
 
 My thought is that the part beginning with The options in detail
 are: should not describe all the possibilities for the --format
 option, that being better left to the reference section.  Likewise,
 this being prose, it might be best to describe all the options
 in-line, instead of presented as a list.  I have left it as-is
 for you to improve as seen fit.

Agreed, it probably looks better as a sentence.

 
 I have frobbed your programlisting to adjust the indentation and
 line-wrap style.  I submit it here for consideration in case this
 style is attractive.  This is nothing but conceit.  We should use the
 same style used elsewhere in the documentation.  (I can't think
 offhand of a place to look for a line-wrapped shell example.  If you
 can't find one I'll take a look and if neither of us finds one we'll
 then have choices.)

Looks good to me.

 
 I don't know that it's necessary to include pigz examples, because it
 sounds like pigz is a drop-in gzip replacement.  I've left your
 examples in, in case you feel they are necessary.

We do. We believe it can encourage more people to consider using it. The way we 
see it, most people seem to be running mutlicore systems these days, yet many 
simply are not aware of pigz. We have been routinely informing our customers of 
pigz as an alternative to tried and tested gzip when helping optimize their 
configurations, and all of them without exception went for it. I guess 
everybody just likes to squeeze some extra juice for free.


 
 The existing text of the SQL Dump section could use some alteration to
 reduce redundancy and add clarity.  I'm thinking specifically of
 mention of pg_restore as being required to restore custom format
 backups and of the default pg_dump output being not just plain text
 but being a collection of SQL commands.  Yes, the latter is obvious
 upon reflection, psql being what it is, but I think it would be
 helpful to spell this out.  Especially in the context of the current
 patch.  There could well be other areas like this to be addressed.

I don't quite follow you here. I mean, I kinda understand what you mean in 
general, but when I look at the text I fail to see what you had in mind 
specifically.

For example, pg_restore is mentioned only 3 times in section 24.1. Each mention 
seems pretty essential to me. And the text flow is pretty natural.

Also, about plain text format being a collection of SQL commands. The very 
first paragraph of the section 24.1 reads The idea

Re: [HACKERS] Minmax indexes

2013-09-26 Thread Alvaro Herrera

Robert Haas escribió:
 On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Here's an updated version of this patch, with fixes to all the bugs
  reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
  Amit Kapila for the reports.
 
 I'm not very happy with the use of a separate relation fork for
 storing this data.

I understand this opinion, as I considered it myself while developing
it.  Also, GIN already does things this way.  Perhaps I should just bite
the bullet and do this.

 Using an existing fork number rather than creating
 a new one avoids some of them (like, the fact that we loop over all
 known fork numbers in various places, and adding another one will add
 latency in all of those places, particularly when there is a system
 call in the loop) but not all of them (like, what happens if the index
 is unlogged?  we have provisions to reset the main fork but any others
 are just removed; is that OK?), and it also creates some new ones
 (like, files having misleading names).

All good points.

Index scans will normally access the revmap in sequential fashion; it
would be enough to chain revmap pages, keeping a single block number in
the metapage pointing to the first one, and subsequent ones are accessed
from a next block number in each page.  However, heap insertion might
need to access a random revmap page, and this would be too slow.  I
think it would be enough to keep an array of block numbers in the
index's metapage; the metapage would be share locked on every scan and
insert, but that's not a big deal because exclusive lock would only be
needed on the metapage to extend the revmap, which would be a very
infrequent operation.

As this will require some rework to this code, I think it's fair to mark
this as returned with feedback for the time being.  I will return with
an updated version soon, fixing the relation fork issue as well as the
locking and visibility bugs reported by Erik.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-09-26 Thread Alvaro Herrera

Erik Rijkers wrote:

 I have the impression it's not quite working correctly.
 
 The attached program returns different results for different values of 
 enable_bitmapscan (consistently).

Clearly there's some bug somewhere.  I'll investigate it more.

 ( Btw, I had to make the max_locks_per_transaction higher for even 
 not-so-large tables -- is that expected?  For a 100M row
 table, max_locks_per_transaction=1024 was not enough; I set it to 2048.  
 Might be worth some documentation, eventually. )

Not documentation -- that would also be a bug which needs to be fixed.

Thanks for testing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory

2013-09-26 Thread Jim Nasby


On 9/26/13 8:27 AM, Noah Misch wrote:

On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:

On Fri, Sep 20, 2013 at 7:44 AM, Andres Freundand...@2ndquadrant.com  wrote:

 Actually, I was wondering if we ought
 to have a directory under pgdata whose explicit charter it was to
 contain files that shouldn't be copied as part of a base backup.
 pg_do_not_back_this_up.

 
 Wondered exactly about that as soon as you've mentioned
 pg_basebackup. pg_local/?


That seems reasonable.  It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know.  Other
suggestions welcome, if anyone has ideas.

I like the concept and have no improvements on the name.


Are there any other likely candidates for inclusion in that directory
other than this stuff?

pg_xlog


Isn't it also pointless to backup temp objects as well as non-logged tables?

Or is the purpose of pg_local to be a home for things that MUST NOT be backed 
up as opposed to items where backing them up is pointless?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6

2013-09-26 Thread Steve Singer

On 09/25/2013 01:20 PM, Steve Singer wrote:

On 09/25/2013 11:08 AM, Andres Freund wrote:

On 2013-09-25 11:01:44 -0400, Steve Singer wrote:

On 09/17/2013 10:31 AM, Andres Freund wrote:
This patch set now fails to apply because of the commit Rename
various

freeze multixact variables.
And I am even partially guilty for that patch...

Rebased patches attached.
While testing the logical replication changes against my WIP logical
slony I

am sometimes getting error messages from the WAL sender of the form:
unexpected duplicate for tablespace X relfilenode X

Any chance you could provide a setup to reproduce the error?

The steps to build a setup that should reproduce this error are:

1. I had apply the attached patch on top of your logical replication
branch so my pg_decode_init would now if it was being called as part
of a INIT_REPLICATION or START_REPLICATION.
Unless I have misunderstood something you probably will want to merge
this fix in

2. Get my WIP for adding logical support to slony from:
g...@github.com:ssinger/slony1-engine.git branch logical_repl
(4af1917f8418a)
(My code changes to slony are more prototype level code quality than
production code quality)

3.
cd slony1-engine
./configure --with-pgconfigdir=/usr/local/pg94wal/bin (or whatever)
make
make install

4. Grab the clustertest framework JAR from
https://github.com/clustertest/clustertest-framework and build up a
clustertest jar file

5. Create a file
slony1-engine/clustertest/conf/java.conf
that contains the path to the above JAR file as a shell variable
assignment: ie
CLUSTERTESTJAR=/home/ssinger/src/clustertest/clustertest_git/build/jar/clustertest-coordinator.jar

6.
cp clustertest/conf/disorder.properties.sample
clustertest/conf/disorder.properties

edit disorder.properites to have the proper values for your
environment. All 6 databases can point at the same postgres instance,
this test will only actually use 2 of them(so far).

7. Run the test
cd clustertest
./run_all_disorder_tests.sh

This involves having the slon connect to the walsender on the database
test1 and replicate the data into test2 (which is a different database
on the same postmaster)

If this setup seems like too much effort I can request one of the
commitfest VM's from Josh and get everything setup there for you.

Steve

Any ideas?

I'll look into it. Could you provide any context to what youre doing
that's being decoded?

I've determined that when in this test the walsender seems to be hitting
this when it is decode the transactions that are behind the slonik
commands to add tables to replication (set add table, set add
sequence). This is before the SUBSCRIBE SET is submitted.

I've also noticed something else that is strange (but might be
unrelated). If I stop my slon process and restart it I get messages like:

WARNING: Starting logical replication from 0/a9321360
ERROR: cannot stream from 0/A9321360, minimum is 0/A9320B00

Where 0/A9321360 was sent in the last packet my slon received from the
walsender before the restart.

If force it to restart replication from 0/A9320B00 I see datarows that I
appear to have already seen before the restart.
I think this is happening when I process the data for 0/A9320B00 but
don't get the feedback message my slon was killed. Is this expected?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench progress report improvements - split 3 v2 - A

2013-09-26 Thread Fabien COELHO




 Patch (4): Redefine latency as reported by pgbench and report lag more.


Here is a first partial patch, which focusses on measuring latency and 
reporting the measure under --progress.


**

Improve pgbench measurements  progress report

Measure transaction latency instead under --rate and --progress.

The previous computed figure does not make sense under throttling:
as sleep throttling time was included in the figures, the displayed
results were plain false.

The latency and its standard deviation are printed out under progress and 
in the final report when available.


It could be made always available, but that would require to accept 
additional gettimeofday calls. I do not think that there is a performance 
issue here, but there is significant opposition to the idea.


Sample output under benchmarking with --progress=1

  progress: 1.0 s, 2626.1 tps, 0.374 stddev 0.597 ms lat
  progress: 2.0 s, 2766.6 tps, 0.358 stddev 0.588 ms lat
  progress: 3.0 s, 2567.4 tps, 0.385 stddev 0.665 ms lat
  progress: 4.0 s, 3014.2 tps, 0.328 stddev 0.593 ms lat
  progress: 5.0 s, 2959.3 tps, 0.334 stddev 0.553 ms lat
  ...
  progress: 16.0 s, 5009.2 tps, 0.197 stddev 0.381 ms lat
  ...
  progress: 24.0 s, 7051.2 tps, 0.139 stddev 0.284 ms lat
  ...
  progress: 50.0 s, 6860.5 tps, 0.143 stddev 0.052 ms lat
  ...

The warmup is quite fast because the DB is on a SSD. In the beginning
the standard deviation is well over the average transaction time, but
when the steady state is reached (later) it is much below.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 06dd709..a48498a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -172,6 +172,7 @@ int			agg_interval;		/* log aggregates instead of individual
  * transactions */
 int			progress = 0;   /* thread progress report every this seconds */
 int progress_nclients = 0; /* number of clients for progress report */
+int			progress_nthreads = 0; /* number of threads for progress report */
 bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
@@ -214,6 +215,8 @@ typedef struct
 	int			nvariables;
 	instr_time	txn_begin;		/* used for measuring transaction latencies */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	int64		txn_latencies;	/* cumulated latencies */
+	int64		txn_sqlats;		/* cumulated square latencies */
 	bool		is_throttled;	/* whether transaction throttling is done */
 	int			use_file;		/* index in sql_files for this client */
 	bool		prepared[MAX_FILES];
@@ -243,6 +246,8 @@ typedef struct
 {
 	instr_time	conn_time;
 	int			xacts;
+	int64		latencies;
+	int64		sqlats;
 	int64   throttle_lag;
 	int64   throttle_lag_max;
 } TResult;
@@ -1003,6 +1008,21 @@ top:
 		}
 
 		/*
+		 * always record latency under progress or throttling
+		 */
+		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		{
+			instr_time	now, diff;
+			int64 latency;
+			INSTR_TIME_SET_CURRENT(now);
+			diff = now;
+			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
+			latency = INSTR_TIME_GET_MICROSEC(diff);
+			st-txn_latencies += latency;
+			st-txn_sqlats += latency*latency;
+		}
+
+		/*
 		 * if transaction finished, record the time it took in the log
 		 */
 		if (logfile  commands[st-state + 1] == NULL)
@@ -1191,8 +1211,8 @@ top:
 		goto top;
 	}
 
-	/* Record transaction start time if logging is enabled */
-	if (logfile  st-state == 0)
+	/* Record transaction start time under logging, progress or throttling */
+	if ((logfile || progress || throttle_delay)  st-state == 0)
 		INSTR_TIME_SET_CURRENT(st-txn_begin);
 
 	/* Record statement start time if per-command latencies are requested */
@@ -2096,6 +2116,7 @@ static void
 printResults(int ttype, int normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
+			 int64 total_latencies, int64 total_sqlats,
 			 int64 throttle_lag, int64 throttle_lag_max)
 {
 	double		time_include,
@@ -2135,6 +2156,22 @@ printResults(int ttype, int normal_xacts, int nclients,
 			   normal_xacts);
 	}
 
+	if (throttle_delay || progress)
+	{
+		/* compute and show latency average and standard deviation */
+		double latency = 0.001 * total_latencies / normal_xacts;
+		double sqlat = (double) total_sqlats / normal_xacts;
+		printf(latency average: %.3f ms\n
+			   latency stddev: %.3f ms\n,
+			   latency, 0.001 * sqrt(sqlat - 100*latency*latency));
+	}
+	else
+ 	{
+		/* only an average latency computed from the duration is available */
+		printf(latency average: %.3f ms\n,
+			   1000.0 * duration / normal_xacts);
+	}
+
 	if (throttle_delay)
 	{
 		/*
@@ -2143,7 +2180,7 @@ printResults(int ttype, int normal_xacts, int nclients,
 		 * transaction.  The measured lag may be caused by thread/client load,
 		 * the database load, or the Poisson

Re: [HACKERS] Minmax indexes

2013-09-26 Thread Jim Nasby


On 9/26/13 12:00 PM, Robert Haas wrote:

More generally, I fear we really opened a bag of worms with this
relation fork stuff.  Every time I turn around I run into a problem
that could be solved by adding another relation fork.  I'm not
terribly sure that it was a good idea to go that way to begin with,
because we've got customers who are unhappy about 3 files/heap due to
inode consumption and slow directory lookups.  I think we would have
been smarter to devise a strategy for storing the fsm and vm pages
within the main fork in some fashion, and I tend to think that's the
right solution here as well.  Of course, it may be hopeless to put the
worms back in the can at this point, and surely these indexes will be
lightly used compared to heaps, so it's not incrementally exacerbating
the problems all that much.  But I still feel uneasy about widening
use of that mechanism.


Why would we add additional code complexity when forks do the trick? That seems 
like a step backwards, not forward.

If the only complaint about forks is directory traversal why wouldn't we go 
with the well established practice of using multiple directories instead of 
glomming everything into one place?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Thu, Sep 26, 2013 at 4:43 AM, Bruce Momjian br...@momjian.us wrote:
 So, I guess my question is if we are only bloating on a contended
 operation, do we expect that to happen so much that bloat is a problem?

Maybe I could have done a better job of explaining the nature of my
concerns around bloat.

I am specifically concerned about bloat and the clean-up of bloat that
occurs between (or during) value locking and eventual row locking,
because of the necessarily opportunistic nature of the way we go from
one to the other. Bloat, and the obligation to clean it up
synchronously, make row lock conflicts more likely. Conflicts make
bloat more likely, because a conflict implies that another iteration,
complete with more bloat, is necessary.

When you consider that the feature will frequently be used with the
assumption that updating is a much more likely outcome, it becomes
clear that we need to be careful about this sort of interplay.

Having said all that, I would have no objection to some reasonable,
bound amount of bloat occurring elsewhere if that made sense. For
example, I'd certainly be happy to consider the question of whether or
not it's worth doing a kind of speculative heap insertion before
acquiring value locks, because that doesn't need to happen again and
again in the same, critical place, in the interim between value
locking and row locking. The advantage of doing that particular thing
would be to reduce the duration that value locks are held - the
disadvantages would be the *usual* disadvantages of bloat. However,
this is obviously a premature discussion to have now, because the
eventual exact nature of value locks are not known.

 I think the big objection to the patch is the additional code complexity
 and the potential to slow down other sessions.  If it is only bloating
 on a contended operation, are these two downsides worth avoiding the
 bloat?

I believe that all other schemes proposed have some degree of bloat
even in the uncontended case, because they optimistically assume than
an insert will occur, when in general an update is perhaps just as
likely, and will bloat just the same. So, as I've said before,
definition of uncontended is important here.

There is no reason to assume that alternative proposals will affect
concurrency any less than my proposal - the buffer locking thing
certainly isn't essential to my design. You need to weigh things like
WAL-logging multiples times, that other proposals have. You're right
to say that all of this is complex, but I really think that quite
apart from anything else, my design is simpler than others. For
example, the design that Robert sketched would introduce a fairly
considerable modularity violation, per my recent remarks to him, and
actually plastering over that would be a considerable undertaking.
Now, you might counter, but those other designs haven't been worked
out enough. That's true, but then my efforts to work them out further
by pointing out problems with them haven't gone very far. I have
sincerely tried to see a way to make them work.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Tue, Sep 24, 2013 at 10:15 PM, Peter Geoghegan p...@heroku.com wrote:
 Well, I think we can rule out value locks that are held for the
 duration of a transaction right away. That's just not going to fly.

I think I agree with that.  I don't think I remember hearing that proposed.

 If we're really lucky, maybe the value locking stuff can be
 generalized or re-used as part of a btree index insertion buffer
 feature.

Well, that would be nifty.

 Also, I tend to think that we might want to define
 the operation as a REPLACE-type operation with respect to a certain
 set of key columns; and so we'll do the insert-or-update behavior with
 respect only to the index on those columns and let the chips fall
 where they may with respect to any others.  In that case this all
 becomes much less urgent.

 Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
 not try an INSERT, then maybe mark the heap tuple if there's a unique
 index dup and then go UPDATE the conflicting tuple. I mention this
 only because the term REPLACE has a certain baggage, and I feel it's
 important to be careful about such things.

I see.  Well, we could try to mimic their semantics, I suppose.  Those
semantics seem like a POLA violation to me; who would have thought
that a REPLACE could delete multiple tuples?  But what do I know?

 The only way that's going to work is if you say use this unique
 index, which will look pretty gross in DML. That might actually be
 okay with me if we had somewhere to go from there in a future release,
 but I doubt that's the case. Another issue is that I'm not sure that
 this helps Andres much (or rather, clients of the logical changeset
 generation infrastructure that need to do conflict resolution), and
 that matters a lot to me here.

Yeah, it's kind of awful.

 Suppose we define the operation as REPLACE rather than INSERT...ON
 DUPLICATE KEY LOCK FOR UPDATE.  Then we could do something like this:

 1. Try to insert a tuple.  If no unique index conflicts occur, stop.
 2. Note the identity of the conflicting tuple and mark the inserted
 heap tuple dead.
 3. If the conflicting tuple's inserting transaction is still in
 progress, wait for the inserting transaction to end.

 Sure, this is basically what the code does today (apart from marking a
 just-inserted tuple dead).

 4. If the conflicting tuple is dead (e.g. because the inserter
 aborted), start over.

 Start over from where? I presume you mean the index tuple insertion,
 as things are today. Or do you mean the very start?

Yes, that's what I meant.

 5. If the conflicting tuple's key columns no longer match the key
 columns of the REPLACE operation, start over.

 What definition of equality or inequality?

Binary equality, same as we'd use to decide whether an update can be done HOT.

 7. Update the tuple, even though it may be invisible to our snapshot
 (a deliberate MVCC violation!).

 I realize that you just wanted to sketch a design, but offhand I think
 that the basic problem with what you describe is that it isn't
 accepting of the inevitability of there being a disconnect between
 value and row locking. Also, this doesn't fit with any roadmap for
 getting a real upsert,

Well, there are two separate issues here: what to do about MVCC, and
how to do the locking.  From an MVCC perspective, I can think of only
two behaviors when the conflicting tuple is committed but invisible:
roll back, or update it despite it being invisible.  If you're saying
you don't like either of those choices, I couldn't agree more, but I
don't have a third idea.  If you do, I'm all ears.

In terms of how to do the locking, what I'm mostly saying is that we
could try to implement this in a way that invents as few new concepts
as possible.  No promise tuples, no new SLRU, no new page-level bits,
just index tuples and heap tuples and so on.  Ideally, we don't even
change the WAL format, although step 2 might require a new record
type.  To the extent that what I actually described was at variance
with that goal, consider it a defect in my explanation rather than an
intent to vary.  I think there's value in considering such an
implementation because each new thing that we have to introduce in
order to get this feature is a possible reason for it to be rejected -
for modularity reasons, or because it hurts performance elsewhere, or
because it's more code we have to maintain, or whatever.

Now, what I hear you saying is, gee, the performance of that might be
terrible.  I'm not sure that I believe that, but it's possible that
you're right. Much seems to depend on what you think the frequency of
conflicts will be, and perhaps I'm assuming it will be low while
you're assuming a higher value.  Regardless, if the performance of the
sort of implementation I'm talking about would be terrible (under some
agreed-upon definition of what terrible means in this context), then
that's a good argument for not doing it that way.  I'm just not
convinced that's the case.

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote:
 When you consider that the feature will frequently be used with the
 assumption that updating is a much more likely outcome, it becomes
 clear that we need to be careful about this sort of interplay.

I think one thing that's pretty clear at this point is that almost any
version of this feature could be optimized for either the insert case
or the update case.  For example, my proposal could be modified to
search for a conflicting tuple first, potentially wasting an index
probes (or multiple index probes, if you want to search for potential
conflicts in multiple indexes) if we're inserting, but winning heavily
in the update case.  As written, it's optimized for the insert case.

In fact, I don't know how to know which of these things we should
optimize for.  I wrote part of the code for an EDB proprietary feature
that can do insert-or-update loads about 6 months ago[1], and we
optimized it for updates.  That was not, however, a matter of
principal; it just turned out to be easier to implement that way.  In
fact, I would have assumed that the insert-mostly case was more
likely, but I think the real answer is that some environments will be
insert-mostly and some will be update-mostly and some will be a mix.

If we really want to squeeze out every last drop of possible
performance, we might need two modes: one that assumes we'll mostly
insert, and another that assumes we'll mostly update.  That seems a
frustrating amount of detail to have to expose to the user; an
implementation that was efficient in both cases would be very
desirable, but I do not have a good idea how to get there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] In case you're wondering, attempting to use that feature to upsert
an invisible tuple will result in the load failing with a unique index
violation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby j...@nasby.net wrote:
 Why would we add additional code complexity when forks do the trick? That
 seems like a step backwards, not forward.

Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11.  I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.

It's tempting to think that we can use the fork mechanism every time
we have multiple logical streams of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file.  However, the reality is that the fork mechanism isn't
up to the job.  I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism.  But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.

 If the only complaint about forks is directory traversal why wouldn't we go
 with the well established practice of using multiple directories instead of
 glomming everything into one place?

That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix

2013-09-26 Thread David Rowley

On Fri, Sep 27, 2013 at 4:44 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Sep 25, 2013 at 4:46 AM, David Rowley dgrowle...@gmail.com
 wrote:
  Ok, I think I've managed to narrow the performance gap to just about
 nothing
  but noise, though to do this the code is now a bit bigger. I've added a
  series of tests to see if the padding is  0 and if it's not then I'm
 doing
  things the old way.
 
  I've also added a some code which does a fast test to see if it is worth
  while calling the padding processing function. This is just a simple if
 (*p
  = '9'), I'm not completely happy with that as it does look a bit weird,
 but
  to compensate I've added a good comment to explain what it is doing.
 
  Please find attached the new patch ... version v0.5 and also updated
  benchmark results.

 Are you sure this is the right set of benchmark results?  This still
 reflects a 15-18% slowdown AFAICS.


I think I must have forgot to save it before I emailed it...

Here's the correct version.




 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



log_line_prefix_benchmark_stresslog_v0.5.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory

On Thu, Sep 26, 2013 at 2:45 PM, Jim Nasby j...@nasby.net wrote:
 On 9/26/13 8:27 AM, Noah Misch wrote:

 On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:

 On Fri, Sep 20, 2013 at 7:44 AM, Andres Freundand...@2ndquadrant.com
  wrote:

  Actually, I was wondering if we ought
  to have a directory under pgdata whose explicit charter it was to
  contain files that shouldn't be copied as part of a base backup.
  pg_do_not_back_this_up.

  
  Wondered exactly about that as soon as you've mentioned
  pg_basebackup. pg_local/?

 
 That seems reasonable.  It's not totally transparent what that's
 supposed to mean, but it's fairly mnemonic once you know.  Other
 suggestions welcome, if anyone has ideas.

 I like the concept and have no improvements on the name.

 Are there any other likely candidates for inclusion in that directory
 other than this stuff?

 pg_xlog


 Isn't it also pointless to backup temp objects as well as non-logged tables?

 Or is the purpose of pg_local to be a home for things that MUST NOT be
 backed up as opposed to items where backing them up is pointless?

I don't know.  I found it surprising that Noah suggested including
pg_xlog; that's certainly not node-local state in any meaningful
sense, the way dsm stuff is.  But I guess if pg_basebackup excludes it
it arguably qualifies.  However, it'd be nice to advertise that
pg_local can be cleared when the server is shut down, and you
certainly CaN nOt do that to pg_xlog.  Whee!

I think the way I'd summarize the state of this patch is that everyone
who has looked at it more or less agrees that the big picture is
right, but there are details, which I think everyone admits are pretty
much corner cases, where different people have different ideas about
which way to go and what risks and benefits might be thereby incurred.
 I'll not pretend that my opinions are categorically better than any
others, but of course I like them because they are mine.  Figuring out
just what to do to about those last details seems challenging; no
answer will completely please everyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-09-26 Thread Alvaro Herrera

I gave this a quick look.  It took me a while to figure out how to apply
it -- turns out you used the ignore whitespace option to diff, so the
only way to apply it was with patch -p1 --ignore-whitespace.  Please
don't do that.

I ran pgindent over the patched code and there were a number of changes.
I suggest you run that over your local copy before your next submission,
to avoid the next official run to mangle your stuff in unforeseen ways.
For instance, the comment starting with A slight edge case would be
mangled; I suggest enclosing that in /* to avoid the problem.
(TBH that's the only interesting thing, but avoiding that kind of
breakage is worth it IMHO.)

First thing I noticed was the funky bms_initialize thingy.  There was
some controversy upthread about the use of bitmapsets, and it seems you
opted for not using them for the constant case as suggested by Jeff; but
apparently the other comment by Robert about the custom bms initializer
went largely ignored.  I agree with him that this is grotty.  However,
the current API to get partition-local memory is too limited as there's
no way to change to the partition's context; instead you only get the
option to allocate a certain amount of memory and return that.  I think
the easiest way to get around this problem is to create a new
windowapi.h function which returns the MemoryContext for the partition.
Then you can just allocate the BMS in that context.

But how do we ensure that the BMS is allocated in a context?  You'd have
to switch contexts each time you call bms_add_member.  I don't have a
good answer to this.  I used this code in another project:

/*
 * grow the visited bitmapset to the index' current size, to avoid
 * repeated repalloc's
 */
{
BlockNumber lastblock;

lastblock = RelationGetNumberOfBlocks(rel);
visited = bms_add_member(visited, lastblock);
visited = bms_del_member(visited, lastblock);
}

This way, I know the bitmapset already has enough space for all the bits
I need and there will be no further allocation.  But this is also
grotty.  Maybe we should have a new entry point in bitmapset.h, say
bms_grow that ensures you have enough space for that many bits.  Or
perhaps add a MemoryContext member to struct Bitmapset, so that all
allocations occur therein.

I'm not too sure I follow the parse_agg.c changes, but don't you need to
free the clone in the branch that finds a duplicate window spec?  Is
this parent/child relationship thingy a preexisting concept, or are you
just coming up with it?  It seems a bit unfamiliar to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2013-09-26 Thread Peter Eisentraut

On 9/20/13 2:42 AM, KONDO Mitsumasa wrote:
 I create gaussinan distribution pgbench patch that can access records with
 gaussian frequency. And I submit this commit fest.

This patch no longer applies.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-09-26 Thread Michael Paquier

On Thu, Sep 26, 2013 at 8:56 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-09-26 20:47:33 +0900, Michael Paquier wrote:
 On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-09-26 20:40:40 +0900, Michael Paquier wrote:
  On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   On 2013-09-26 12:13:30 +0900, Michael Paquier wrote:
2) I don't think the drop algorithm used now is correct. Your
index_concurrent_set_dead() sets both indisvalid = false and 
indislive =
false at the same time. It does so after doing a 
WaitForVirtualLocks() -
but that's not sufficient. Between waiting and setting indisvalid =
false another transaction could start which then would start using 
that
index. Which will not get updated anymore by other concurrent 
backends
because of inislive = false.
You really need to follow index_drop's lead here and first unset
indisvalid then wait till nobody can use the index for querying 
anymore
and only then unset indislive.
  
   Sorry, I do not follow you here. index_concurrent_set_dead calls
   index_set_state_flags that sets indislive and *indisready* to false,
   not indisvalid. The concurrent index never uses indisvalid = true so
   it can never be called by another backend for a read query. The drop
   algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw.
  
   That makes it even worse... You can do the concurrent drop only in the
   following steps:
   1) set indisvalid = false, no future relcache lookups will have it as 
   valid
 
  indisvalid is never set to true for the concurrent index. Swap is done
  with concurrent index having indisvalid = false and former index with
  indisvalid = true. The concurrent index is validated with
  index_validate in a transaction before swap transaction.
 
  Yes. I've described how it *has* to be done, not how it's done.
 
  The current method of going straight to isready = false for the original
  index will result in wrong results because it's not updated anymore
  while it's still being used.

 The index being dropped at the end of process is not the former index,
 but the concurrent index. The index used after REINDEX CONCURRENTLY is
 the old index but with the new relfilenode.

 That's not relevant unless I miss something.

 After phase 4 both indexes are valid (although only the old one is
 flagged as such), but due to the switching of the relfilenodes backends
 could have either of both open, depending on the time they built the
 relcache entry. Right?
 Then you go ahead and mark the old index - which still might be used! -
 as dead in phase 5. Which means other backends might (again, depending
 on the time they have built the relcache entry) not update it
 anymore. In read committed we very well might go ahead and use the index
 with the same plan as before, but with a new snapshot. Which now will
 miss entries.
In this case, doing a call to WaitForOldSnapshots after the swap phase
is enough. It was included in past versions of the patch but removed
in the last 2 versions.

Btw, taking the problem from another viewpoint... This feature has now
3 patches, the 2 first patches doing only code refactoring. Could it
be possible to have a look at those ones first? Straight-forward
things should go first, simplifying the core feature evaluation.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

On 2013-09-27 05:41:26 +0900, Michael Paquier wrote:
 In this case, doing a call to WaitForOldSnapshots after the swap phase
 is enough. It was included in past versions of the patch but removed
 in the last 2 versions.

I don't think it is. I really, really suggest following the protocol
used by index_drop down to the t and document every *slight* deviation
carefully.
We've had more than one bug in index_drop's concurrent feature.

 Btw, taking the problem from another viewpoint... This feature has now
 3 patches, the 2 first patches doing only code refactoring. Could it
 be possible to have a look at those ones first? Straight-forward
 things should go first, simplifying the core feature evaluation.

I haven't looked at them in detail, but they looked good on a quick
pass. I'll make another pass, but that won't be before, say, Tuesday.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Thu, Sep 26, 2013 at 03:33:34PM -0400, Robert Haas wrote:
 On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote:
  When you consider that the feature will frequently be used with the
  assumption that updating is a much more likely outcome, it becomes
  clear that we need to be careful about this sort of interplay.
 
 I think one thing that's pretty clear at this point is that almost any
 version of this feature could be optimized for either the insert case
 or the update case.  For example, my proposal could be modified to
 search for a conflicting tuple first, potentially wasting an index
 probes (or multiple index probes, if you want to search for potential
 conflicts in multiple indexes) if we're inserting, but winning heavily
 in the update case.  As written, it's optimized for the insert case.

I assumed the code was going to do the index lookups first without a
lock, and take the appropriate action, insert or update, with fallbacks
for guessing wrong.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix

On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com wrote:
 I think I must have forgot to save it before I emailed it...

 Here's the correct version.

Ah ha.  Looks better.

I'm working on getting this committed now.  Aside from some stylistic
things, I've found one serious functional bug, which is that you need
to test padding != 0, not padding  0, when deciding which path to
take.  No need to send a new patch, I've already fixed it in my local
copy...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] record identical operator - Review

On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote:
 On 09/12/2013 06:27 PM, Kevin Grittner wrote:
 Attached is a patch for a bit of infrastructure I believe to be
 necessary for correct behavior of REFRESH MATERIALIZED VIEW
 CONCURRENTLY as well as incremental maintenance of matviews.
 
 Here is attempt at a review of the v1 patch.
 
 There has been a heated discussion on list about if we even want
 this type of operator. I will try to summarize it here
 
 The arguments against it are
 * that a record_image_identical call on two records that returns
 false can still return true under the equality operator, and the
 records might (or might not) appear to be the same to users
 * Once we expose this as an operator via SQL someone will find it
 and use it and then complain when we change things such as the
 internal representation of a datatype which might
break there queries
 * The differences between = and record_image_identical might not be
 understood by everywhere who finds the operator leading to confusion
 * Using a criteria other than equality (the = operator provided by
 the datatype) might cause problems if we later provide 'on change'
 triggers for materialized views
 
 The arguments for this patch are
 * We want the materialized view to return the same value as would be
 returned if the query were executed directly.  This not only means
 that the values should be the same according to a datatypes =
 operator but that they should appear the same 'to the eyeball'.
 * Supporting the materialized view refresh check via SQL makes a lot
 of sense but doing that requires exposing something via SQL
 * If we adequately document what we mean by record_image_identical
 and the operator we pick for this then users shouldn't be surprised
 at what they get if they use this

This is a good summary.  I think there are a few things that make this
issue difficult to decide:

1.  We have to use an operator to give the RMVC (REFRESH MATERIALIZED
VIEW CONCURRENTLY) SPI query the ability to optimize this query.   If we
could do this with an SQL or C function, there would be less concern
about the confusion caused by adding this capability.

Question:  If we are comparing based on some primary key, why do we need
this to optimize?  Certainly the primary key index will be used, no?

2.  Agregates are already non-deterministic, so some feel that adding
this feature doesn't improve much.  The counter-argument is that without
the binary row comparison ability, rows could be returned that could
_never_ have been produced by the base data, which is more of a user
surprise than non-deterministic rows.

3.  Our type casting and operators are already complex, and adding
another set of operators only compounds that.

4.  There are legitimate cases where tool makers might want the ability
to compare rows binarily, e.g. for replication, and adding these
operators would help with that.

I think we need to see a patch from Kevin that shows all the row
comparisons documented and we can see the impact of the user-visibile
part.

One interesting approach would be to only allow the operator to be
called from SPI queries.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_upgrade: Split off pg_fatal() from pg_log()

On Thu, Sep 12, 2013 at 10:50:42PM -0400, Peter Eisentraut wrote:
 The experiences with elog() and ereport() have shown that having one
 function that can return or not depending on some log level parameter
 isn't a good idea when you want to communicate well with the compiler.
 In pg_upgrade, there is a similar case with the pg_log() function.
 Since that isn't a public API, I'm proposing to change it and introduce
 a separate function pg_fatal() for those cases where the calls don't
 return.

Fine with me.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2013-09-26 Thread Alexander Korotkov

On Wed, Sep 25, 2013 at 5:22 PM, Peter Eisentraut pete...@gmx.net wrote:

 On 9/23/13 5:36 PM, Alexander Korotkov wrote:
  In the attached version of patch double finding of ItemPointer during
  insert is avoided. Overhead becomes lower as expected.

 Fails cpluspluscheck:

 ./src/include/access/gin_private.h: In function ‘char*
 ginDataPageLeafReadItemPointer(char*, ItemPointer)’:
 ./src/include/access/gin_private.h:797:2: warning: comparison between
 signed and unsigned integer expressions [-Wsign-compare]


Thanks. Fixed in attached version of patch.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix

On Thu, Sep 26, 2013 at 5:18 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com wrote:
 I think I must have forgot to save it before I emailed it...

 Here's the correct version.

 Ah ha.  Looks better.

 I'm working on getting this committed now.  Aside from some stylistic
 things, I've found one serious functional bug, which is that you need
 to test padding != 0, not padding  0, when deciding which path to
 take.  No need to send a new patch, I've already fixed it in my local
 copy...

Committed now.  Let me know if I broke anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Wait free LW_SHARED acquisition

Hello,

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

That's rougly 400%.

So, nice little improvement. Unless - not entirely unlikely - I fucked
up and it's fast because it's broken.

Anyway, here's the algorith I chose to implement:
The basic idea is to have a single 'uint32 lockcount' instead of the
former 'char exclusive' and 'int shared' and to use an atomic increment
to acquire the lock. That's fairly easy to do for rw-spinlocks, but a
lot harder for something like LWLocks that want to wait in the OS.

Exlusive lock acquisition:

Use an atomic compare-and-exchange on the lockcount variable swapping in
EXCLUSIVE_LOCK/131/0x8000 if and only if the current value of
lockcount is 0. If the swap was not successfull, we have to wait.


Shared lock acquisition:

Use an atomic add (lock xadd) to the lockcount variable to add 1. If the
value is bigger than EXCLUSIVE_LOCK we know that somebody actually has
an exclusive lock, and we back out by atomically decrementing by 1
again.
If so, we have to wait for the exlusive locker to release the lock.

Queueing  Wakeup:

Whenever we don't get a shared/exclusive lock we us nearly the same
queuing mechanism as we currently do. While we probably could make it
lockless as well, the queue currently is still protected by the good old
spinlock.

Relase:

Use a atomic decrement to release the lock. If the new value is zero (we
get that atomically), we know we have to release waiters.

And the real world:

Now, as you probably have noticed, naively doing the above has two
glaring race conditions:

1) too-quick-for-queueing:
We try to lock using the atomic operations and notice that we have to
wait. Unfortunately until we have finished queuing, the former locker
very well might have already finished it's work.

2) spurious failed locks:
Due to the logic of backing out of shared locks after we unconditionally
added a 1 to lockcount, we might have prevented another exclusive locker
from getting the lock:
1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success
2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1
3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK
4) Session B: LWLockRelease()
5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait.
6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1
7) Session B: LWLockAcquire(LW_SHARED) - wait

So now we can have both B) and C) waiting on a lock that nobody is
holding anymore. Not good.


The solution:
We use a two phased attempt at locking:
Phase 1: Try to do it atomically, if we succeed, nice
Phase 2: Add us too the waitqueue of the lock
Phase 3: Try to grab the lock again, if we succeed, remove ourselves
 from the queue
Phase 4: Sleep till wakeup, goto Phase 1

This protects us against both problems from above:
1) Nobody can release too quick, before we're queued, after Phase 2 since we're 
already
queued.
2) If somebody spuriously got blocked from acquiring the lock, they will
get queued in Phase 2 and we can wake them up if neccessary.

Now, there are lots of tiny details to handle additionally to those, but
those seem better handled by looking at the code?
- The above algorithm only works for LWLockAcquire, not directly for
  LWLockAcquireConditional where we don't want to wait. In that case we
  just need to retry acquiring the lock until we're sure we didn't
  disturb anybody in doing so.
- we can get removed from the queue of waiters in Phase 3, before we remove
  ourselves. In that case we need to absorb the wakeup.
- Spurious locks can prevent us from recognizing a lock that's free
  during release. Solve it by checking for existing waiters whenever an
  exlusive lock is released.

I've done a couple of off-hand benchmarks and so far I can confirm that
everything using lots of shared locks benefits greatly and everything
else doesn't really change much. So far I've seen mostly some slight
improvements in exclusive lock heavy workloads, but those were pretty
small.
It's also very important to mention that those speedups are only there
on multi-socket machines. From what I've benchmarked so far in LW_SHARED
heavy workloads with 1 socket you get ~5-10%, 2 sockets 20-30% and
finally and

Re: [HACKERS] Wait free LW_SHARED acquisition

On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund and...@2ndquadrant.com wrote:
 We have had several customers running postgres on bigger machines report
 problems on busy systems. Most recently one where a fully cached
 workload completely stalled in s_lock()s due to the *shared* lwlock
 acquisition in BufferAlloc() around the buffer partition lock.

That's unfortunate. I saw someone complain about what sounds like
exactly the same issue on Twitter yesterday:

https://twitter.com/Roguelazer/status/382706273927446528

I tried to engage with him, but was unsuccessful.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

On 2013-09-26 16:56:30 -0700, Peter Geoghegan wrote:
 On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund and...@2ndquadrant.com wrote:
  We have had several customers running postgres on bigger machines report
  problems on busy systems. Most recently one where a fully cached
  workload completely stalled in s_lock()s due to the *shared* lwlock
  acquisition in BufferAlloc() around the buffer partition lock.
 
 That's unfortunate. I saw someone complain about what sounds like
 exactly the same issue on Twitter yesterday:

Well, fortunately there's a solution, as presented here ;)

There's another bottleneck in the heaps of PinBuffer() calls in such
workloads, that present themselves after fixing the lwlock contention,
at least in my tests. I think I see a solution there, but let's fix this
first though ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] record identical operator - Review

On Thu, Sep 26, 2013 at 05:50:08PM -0400, Bruce Momjian wrote:
 I think we need to see a patch from Kevin that shows all the row
 comparisons documented and we can see the impact of the user-visibile
 part.
 
 One interesting approach would be to only allow the operator to be
 called from SPI queries.

It would also be good to know about similar non-default entries in
pg_opclass so we can understand the expected impact.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] bitmap indexes

2013-09-26 Thread Abhijit Menon-Sen

At 2013-09-26 08:39:05 -0700, jeff.ja...@gmail.com wrote:

 I don't know about grottiness, but it certainly seems like it would
 deadlock like crazy.

Hi Jeff.

I don't understand the deadlock scenario you're thinking of. Could you
explain, please?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-26 Thread Karl O. Pinc

On 09/26/2013 12:15:25 PM, Ivan Lezhnjov IV wrote:
 
 On Sep 3, 2013, at 6:56 AM, Karl O. Pinc k...@meme.com wrote:
 
  On 07/31/2013 12:08:12 PM, Ivan Lezhnjov IV wrote:
  
  Patch filename: backup.sgml-cmd-v003.patch
  
  The third version of this patch takes into consideration feedback
  received after original submission (it can be read starting from
 this
  message http://www.postgresql.org/message-id/CA
  +Tgmoaq-9D_mst113TdW=ar8mpgbc+x6t61azk3emhww9g...@mail.gmail.com)
  
  Essentially, it addresses the points that were raised in community
  feedback and offers better worded statements that avoid implying
 that
  some features are being deprecated when it isn't the case. We also
  spent some more time polishing other details, like making
 adjustments
  to the tone of the text so that it sounds more like a manual, and 
  less
  like a blog post. More importantly, this chapter now makes it 
 clear
  that superuser privileges are not always required to perform a
  successful backup because in practice as long as the role used to 
  make
  a backup has sufficient read privileges on all of the objects a
 user
  is interested in it's going to work just fine. We also mention and
  show examples of usage for pg_restore and pigz alongside with 
 gzip,
  and probably something else too.

  ---
  
  Cleaned up and clarified here and there.
  
  The bit about OIDs being depreciated might properly belong in 
  a separate patch.  The same might be said about adding mention of
 pigz.
  If you submit these as separate patch file attachments
  they can always be applied in a single commit, but the reverse is 
  more work for the committer.  (Regardless, I see no reason to
  have separate commitfest entries or anything other than multiple
  attachments to the email that finalizes our discussion.)
 
 Hello,
 
 took me a while to get here, but a lot has been going on...

No worries.
 
 Okay, I'm new and I don't know why a single patch like this is more
 work for a commiter? Just so I understand and know.

Different committers have different preferences but the general rule
is that it's work to split a patch into pieces if you don't like the
whole thing but it's easy to apply a bunch of small patches and
commit them all at once.  Further each commit should represent
a single feature or conceptual change.  Again, preferences vary
but I like to think that a good rule is that 1 commit should
be able to be described in a sentence, and not a run-on sentence
either that says I did this and I also did that and something else.
So if there's a question in your mind about whether a committer
will want your entire change, or if your patch changes unrelated
things then it does not hurt to submit it as separate patches.
All the patches can be attached to a single email and part of
a single commitfest tracking entry, usually.  No need to get
crazy.  These are just things to think about.

In your case I see 3 things happening:

oid depreciation

custom format explanation

pigz promotion


  My thought is that the part beginning with The options in detail
  are: should not describe all the possibilities for the --format
  option, that being better left to the reference section.  Likewise,
  this being prose, it might be best to describe all the options
  in-line, instead of presented as a list.  I have left it as-is
  for you to improve as seen fit.
 
 Agreed, it probably looks better as a sentence.

Looks good.

 
  
  I have frobbed your programlisting to adjust the indentation and
  line-wrap style.  I submit it here for consideration in case this
  style is attractive.  This is nothing but conceit.  We should use
 the
  same style used elsewhere in the documentation.  

 Looks good to me.

I fixed the missing \ I messed up on last time
and slightly re-worded the previous sentence.

I've grep-ed through the sgml looking for multi-line shell scripts
and found only 1 (in sepgsql.sgm).  I don't see a conflict with
the formatting/line-break convention used in the patch, although
it does differ slightly in indentation.  I'm leaving the shell
script formatting in the patch as-is for the committer to judge.
(I like the way it looks but it is not a traditional style.)

 
  
  I don't know that it's necessary to include pigz examples, because
 it
  sounds like pigz is a drop-in gzip replacement.  I've left your
  examples in, in case you feel they are necessary.
 
 We do. We believe it can encourage more people to consider using it.
 The way we see it, most people seem to be running mutlicore systems
 these days, yet many simply are not aware of pigz.

Ok.  It's your patch.

  
  The existing text of the SQL Dump section could use some alteration
 to
  reduce redundancy and add clarity.  I'm thinking specifically of
  mention of pg_restore as being required to restore custom format
  backups and of the default pg_dump output being not just plain
 text
  but being a collection of SQL commands.  Yes, the latter is obvious
  upon reflection,

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, I think we can rule out value locks that are held for the
 duration of a transaction right away. That's just not going to fly.

 I think I agree with that.  I don't think I remember hearing that proposed.

I think I might have been unclear - I mean locks that are held for the
duration of *another* transaction, not our own, as we wait for that
other transaction to commit/abort. I think that earlier remarks from
yourself and Andres implied that this would be necessary. Perhaps I'm
mistaken. Your most recent design proposal doesn't do this, but I
think that that's only because it restricts the user to a single
unique index - it would otherwise be necessary to sit on the earlier
value locks (index tuples belonging to an unfinished transaction)
pending the completion of some other conflicting transaction, which
has numerous disadvantages (as described in my it suits my purposes
to have the value locks be held for only an instant mail to you [1]).

 If we're really lucky, maybe the value locking stuff can be
 generalized or re-used as part of a btree index insertion buffer
 feature.

 Well, that would be nifty.

Yes, it would. I think, based on a conversation with Rob Wultsch, that
it's another area where MySQL still do quite a bit better.

 I see.  Well, we could try to mimic their semantics, I suppose.  Those
 semantics seem like a POLA violation to me; who would have thought
 that a REPLACE could delete multiple tuples?  But what do I know?

I think that it's fairly widely acknowledged to not be very good.
Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead.

 The only way that's going to work is if you say use this unique
 index, which will look pretty gross in DML.

 Yeah, it's kind of awful.

It is.

 What definition of equality or inequality?

 Binary equality, same as we'd use to decide whether an update can be done HOT.

I guess that's acceptable in theory, because binary equality is
necessarily a *stricter* condition than equality according to some
operator that is an equivalence relation. But the fact remains that
you're just ameliorating the problem by making it happen less often
(both through this kind of trick, but also by restricting us to one
unique index), not actually fixing it.

 Well, there are two separate issues here: what to do about MVCC, and
 how to do the locking.

Totally agreed. Fortunately, unlike the different aspects of value and
row locking, I think that these two questions can be reasonable
considered independently.

 From an MVCC perspective, I can think of only
 two behaviors when the conflicting tuple is committed but invisible:
 roll back, or update it despite it being invisible.  If you're saying
 you don't like either of those choices, I couldn't agree more, but I
 don't have a third idea.  If you do, I'm all ears.

I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.

FWIW, I'm tentatively looking at doing something like this:

*** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 958,963 
--- 959,975 
  * By here, the inserting transaction has committed - have to check
  * when...
  */
+
+ /*
+ * Not necessarily visible to snapshot under conventional MVCC rules, but
+ * still locked by our xact and not updated -- importantly, normal MVCC
+ * semantics apply when we update the row, so only one version will be
+ * visible at once
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple-t_infomask) 
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+
  if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
  return false; /* treat as still in progress */

This is something that I haven't given remotely enough thought yet, so
please take it with a big grain of salt.

 In terms of how to do the locking, what I'm mostly saying is that we
 could try to implement this in a way that invents as few new concepts
 as possible.  No promise tuples, no new SLRU, no new page-level bits,
 just index tuples and heap tuples and so on.  Ideally, we don't even
 change the WAL format, although step 2 might require a new record
 type.  To the extent that what I actually described was at variance
 with that goal, consider it a defect in my explanation rather than an
 intent to vary.  I think there's value in considering such an
 implementation because each new thing that we have to introduce in
 order to get this feature is a possible reason for it to be rejected -
 for modularity reasons, or because it hurts performance elsewhere, or
 because it's more code we have to maintain, or whatever.

There is certainly value in considering that, and you're right to take
that tact - it is

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE