date:20141002


On 10/02/2014 02:52 AM, Peter Geoghegan wrote:

Having been surprisingly successful at advancing our understanding of
arguments for and against various approaches to value locking, I
decided to try the same thing out elsewhere. I have created a
general-purpose UPSERT wiki page.

The page is: https://wiki.postgresql.org/wiki/UPSERT


Thanks!

Could you write down all of the discussed syntaxes, using a similar 
notation we use in the manual, with examples on how to use them? And 
some examples on what is possible with some syntaxes, and not with 
others? That would make it a lot easier to compare them.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins


On 10/02/2014 03:20 AM, Kevin Grittner wrote:

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average:   0.536 ms

That *might* just be noise, but it seems likely to be real.  For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.


The patch doesn't modify the planner at all, so it would be rather 
surprising if it increased planning time. I'm willing to just write that 
off as noise.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-10-02 Thread Tomas Vondra

Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a):
 Tomas Vondra t...@fuzzy.cz wrote:
 On 12.9.2014 23:22, Robert Haas wrote:

 My first thought is to revert to NTUP_PER_BUCKET=1, but it's
 certainly arguable. Either method, though, figures to be better than
 doing nothing, so let's do something.

 OK, but can we commit the remaining part first? Because it'll certainly
 interact with this proposed part, and it's easier to tweak when the code
 is already there. Instead of rebasing over and over.

 The patch applied and built without problem, and pass `make
 check-world`.  The only thing that caught my eye was the addition
 of debug code using printf() instead of logging at a DEBUG level.
 Is there any reason for that?

Not really. IIRC the main reason it that the other code in nodeHash.c uses
the same approach.

 I still need to work through the (rather voluminous) email threads
 and the code to be totally comfortable with all the issues, but
 if Robert and Heikki are comfortable with it, I'm not gonna argue.

;-)

 Preliminary benchmarks look outstanding!  I tried to control
 carefully to ensure consistent results (by dropping, recreating,
 vacuuming, analyzing, and checkpointing before each test), but
 there were surprising outliers in the unpatched version.  It turned
 out that it was because slight differences in the random samples
 caused different numbers of buckets for both unpatched and patched,
 but the patched version dealt with the difference gracefully while
 the unpatched version's performance fluctuated randomly.

Can we get a bit more details on the test cases? I did a number of tests
while working on the patch, and I generally tested two cases:

(a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET)

(b) mis-estimated queries, with various levels of accuracy (say, 10x -
1000x misestimates)

From the description, it seems you only tested (b) - is that correct?

The thing is that while the resize is very fast and happens only once,
it's not perfectly free. In my tests, this was always more than
compensated by the speedups (even in the weird cases reported by Stephen
Frost), so I think we're safe here.

Also, I definitely recommend testing with different hash table sizes (say,
work_mem=256MB and the hash table just enough to fit in without batching).
The thing is the effect of CPU caches is very different for small and
large hash tables. (This is not about work_mem alone, but about how much
memory is used by the hash table - according to the results you posted it
never gets over ~4.5MB)


You tested against current HEAD, right? This patch was split into three
parts, two of which were already commited (45f6240a and 8cce08f1). The
logic of the patch was this might take a of time/memory, but it's
compensated by these other changes. Maybe running the tests on the
original code would be interesting?

Although, if this last part of the patch actually improves the performance
on it's own, we're fine - it'll improve it even more compared to the old
code (especially before lowering NTUP_PER_BUCKET 10 to 1).

 My only concern from the benchmarks is that it seemed like there
 was a statistically significant increase in planning time:

 unpatched plan time average: 0.450 ms
 patched plan time average:   0.536 ms

 That *might* just be noise, but it seems likely to be real.  For
 the improvement in run time, I'd put up with an extra 86us in
 planning time per hash join; but if there's any way to shave some
 of that off, all the better.

I agree with Heikki that this is probably noise, because the patch does
not mess with planner at all.

The only thing I can think of is adding a few fields into
HashJoinTableData. Maybe this makes the structure too large to fit into a
cacheline, or something?

Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPSERT wiki page, and SQL MERGE syntax

2014-10-02 Thread Peter Geoghegan

On Thu, Oct 2, 2014 at 12:07 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Could you write down all of the discussed syntaxes, using a similar notation
 we use in the manual, with examples on how to use them? And some examples on
 what is possible with some syntaxes, and not with others? That would make it
 a lot easier to compare them.

I've started off by adding varied examples of the use of the existing
proposed syntax. I'll expand on this soon.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Escaping from blocked send() reprised.

2014-10-02 Thread Kyotaro HORIGUCHI

Hello,

  I propose the attached patch. It adds a new flag ImmediateDieOK, which is a
  weaker form of ImmediateInterruptOK that only allows handling a pending
  die-signal in the signal handler.
  
  Robert, others, do you see a problem with this?
 
 Per se I don't have a problem with it. There does exist the problem that
 the user doesn't get a error message in more cases though. On the other
 hand it's bad if any user can prevent the database from restarting.
 
  Over IM, Robert pointed out that it's not safe to jump out of a signal
  handler with siglongjmp, when we're inside library calls, like in a callback
  called by OpenSSL. But even with current master branch, that's exactly what
  we do. In secure_raw_read(), we set ImmediateInterruptOK = true, which means
  that any incoming signal will be handled directly in the signal handler,
  which can mean elog(ERROR). Should we be worried? OpenSSL might get confused
  if control never returns to the SSL_read() or SSL_write() function that
  called secure_raw_read().
 
 But this is imo prohibitive. Yes, we're doing it for a long while. But
 no, that's not ok. It actually prompoted me into prototyping the latch
 thing (in some other thread). I don't think existing practice justifies
 expanding it further.

I see, in that case, this approach seems basically
applicable. But if I understand correctly, this patch seems not
to return out of the openssl code even when latch was found to be
set in secure_raw_write/read.  I tried setting errno = ECONNRESET
and it went well but seems a bad deed.

secure_raw_write(Port *port, const void *ptr, size_t len)
{
  n = send(port-sock, ptr, len, 0);
  
  if (!port-noblock  n  0  (errno == EWOULDBLOCK || errno == EAGAIN))
  {
w = WaitLatchOrSocket(MyProc-procLatch, ...

if (w  WL_LATCH_SET)
{
ResetLatch(MyProc-procLatch);
/*
* Force a return, so interrupts can be processed when not
* (possibly) underneath a ssl library.
*/
errno = EINTR;
(return n;  // n is negative)


my_sock_write(BIO *h, const char *buf, int size)
{
  res = secure_raw_write(((Port *) h-ptr), buf, size);
  BIO_clear_retry_flags(h);
  if (res = 0)
  {
if (errno == EINTR || errno == EWOULDBLOCK || errno == EAGAIN)
{
  BIO_set_retry_write(h);
  

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication identifiers, take 3

On 09/23/2014 09:24 PM, Andres Freund wrote:

I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.

The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

And even earlier here:
http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-and...@2ndquadrant.com#1339586927-13156-10-git-send-email-and...@2ndquadrant.com
The thread branched a lot, the relevant branch is the one with subject
[PATCH 10/16] Introduce the concept that wal has a 'origin' node

== Identify the origin of changes ==

Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.

To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.

If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.

It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.

An origin column in the table itself helps tremendously to debug issues
with the replication system. In many if not most scenarios, I think
you'd want to have that extra column, even if it's not strictly required.

What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.

Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the
WAL record format?

And who's to say that a node ID is the most useful piece of information
for a replication system to add to the WAL header. I can easily imagine
that you'd want to put a changeset ID or something else in there,
instead. (I mentioned another example of this in
http://www.postgresql.org/message-id/4fe17043.60...@enterprisedb.com)

If we need additional information added to WAL records, for extensions,
then that should be made in an extensible fashion. IIRC (I couldn't find
a link right now), when we discussed the changes to heap_insert et al
for wal_level=logical, I already argued back then that we should make it
possible for extensions to annotate WAL records, with things like this
is the primary key, or whatever information is needed for conflict
resolution, or handling loops. I don't like it that we're adding little
pieces of information to the WAL format, bit by bit.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Escaping from blocked send() reprised.

On 2014-10-02 17:47:39 +0900, Kyotaro HORIGUCHI wrote:
 Hello,
 
   I propose the attached patch. It adds a new flag ImmediateDieOK, which is 
   a
   weaker form of ImmediateInterruptOK that only allows handling a pending
   die-signal in the signal handler.
   
   Robert, others, do you see a problem with this?
  
  Per se I don't have a problem with it. There does exist the problem that
  the user doesn't get a error message in more cases though. On the other
  hand it's bad if any user can prevent the database from restarting.
  
   Over IM, Robert pointed out that it's not safe to jump out of a signal
   handler with siglongjmp, when we're inside library calls, like in a 
   callback
   called by OpenSSL. But even with current master branch, that's exactly 
   what
   we do. In secure_raw_read(), we set ImmediateInterruptOK = true, which 
   means
   that any incoming signal will be handled directly in the signal handler,
   which can mean elog(ERROR). Should we be worried? OpenSSL might get 
   confused
   if control never returns to the SSL_read() or SSL_write() function that
   called secure_raw_read().
  
  But this is imo prohibitive. Yes, we're doing it for a long while. But
  no, that's not ok. It actually prompoted me into prototyping the latch
  thing (in some other thread). I don't think existing practice justifies
  expanding it further.
 
 I see, in that case, this approach seems basically
 applicable. But if I understand correctly, this patch seems not
 to return out of the openssl code even when latch was found to be
 set in secure_raw_write/read.

Correct. That's why I think it's the way forward. There's several
problems now where the inability to do real things while reading/writing
is a problem.

  I tried setting errno = ECONNRESET
 and it went well but seems a bad deed.

Where and why did you do that?

 secure_raw_write(Port *port, const void *ptr, size_t len)
 {
   n = send(port-sock, ptr, len, 0);
   
   if (!port-noblock  n  0  (errno == EWOULDBLOCK || errno == EAGAIN))
   {
 w = WaitLatchOrSocket(MyProc-procLatch, ...
 
 if (w  WL_LATCH_SET)
 {
   ResetLatch(MyProc-procLatch);
 /*
 * Force a return, so interrupts can be processed when not
 * (possibly) underneath a ssl library.
 */
 errno = EINTR;
 (return n;  // n is negative)
 
 
 my_sock_write(BIO *h, const char *buf, int size)
 {
   res = secure_raw_write(((Port *) h-ptr), buf, size);
   BIO_clear_retry_flags(h);
   if (res = 0)
   {
 if (errno == EINTR || errno == EWOULDBLOCK || errno == EAGAIN)
 {
   BIO_set_retry_write(h);

Hm, this seems, besides one comment, the code from the last patch in my
series. Do you have a particular question about it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication identifiers, take 3

On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote:
On 09/23/2014 09:24 PM, Andres Freund wrote:
I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.

The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

Right. Long time ago already ;)

== Identify the origin of changes ==

If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.

It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.

An origin column in the table itself helps tremendously to debug issues with
the replication system. In many if not most scenarios, I think you'd want to
have that extra column, even if it's not strictly required.

I don't think you'll have much success convincing actual customers of
that. It's one thing to increase the size of the WAL stream a bit, it's
entirely different to persistently increase the table size of all their
tables.

What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.

Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the WAL
record format?

XLogRecord isn't all that physical. It doesn't encode anything in that
regard but the fact that there's backup blocks in the record. It's
essentially just an implementation detail of logging. Whether that's
physical or logical doesn't really matter much.

There's basically two primary reasons I think it's a good idea to add it
there:

a) There's many different type of records where it's useful to add the
origin. Adding the information to all these will make things more
complicated, using more space, and be more fragile. And I'm pretty
sure that the number of things people will want to expose over
logical replication will increase.

I know of at least two things that have at least some working code:
Exposing 2PC to logical decoding to allow optionally synchronous
replication, and allowing to send transactional/nontransactional
'messages' via the WAL without writing to a table.

Now, we could add a framework to attach general information to every
record - but I have a very hard time seing how this will be of
comparable complexity *and* efficiency.

b) It's dead simple with a pretty darn low cost. Both from a runtime as
well as a maintenance perspective.

c) There needs to be crash recovery interation anyway to compute the
state of how far replication succeeded before crashing. So it's not
like we could make this completely extensible without core code
knowing.

And who's to say that a node ID is the most useful piece of information for
a replication system to add to the WAL header. I can easily imagine that
you'd want to put a changeset ID or something else in there, instead. (I
mentioned another example of this in
http://www.postgresql.org/message-id/4fe17043.60...@enterprisedb.com)

I'm onboard with adding a extensible facility to attach data to
successful transactions. There've been at least two people asking me
directly about how to e.g. attach user information to transactions.

I don't think that's equivalent with what I'm talking about here
though. One important thing about this proposal is that it allows to
completely skip (nearly, except cache inval) all records with a
uninteresting origin id *before* decoding them. Without having to keep
any per transaction state about 'uninteresting'

Re: [HACKERS] Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)

On 2014-10-01 18:19:05 +0200, Ilya Kosmodemiansky wrote:
 I have a patch which is actually not commitfest-ready now, but it
 always better to start discussing proof of concept having some patch
 instead of just an idea.

That's a good way to start work on a topic like this.

 From an Oracle DBA's point of view, currently we have a lack of
 performance diagnostics tools.

Not just from a oracle DBA POV ;). Generally.

So I'm happy to see some focus on this!

 Saying that, principally they mean an
 Oracle Wait Interface analogue. The Basic idea is to have counters or
 sensors all around database kernel to measure what a particular
 backend is currently waiting for and how long/how often it waits.

Yes, I can see that. I'm not sure whether lwlocks are the primary point
I'd start with though. In many cases you'll wait on so called
'heavyweight' locks too...

 Suppose we have a PostgreSQL instance under heavy write workload, but
 we do not know any details. We could pull from time to time
 pg_stat_lwlock function which would say pid n1 currently in
 WALWriteLock and pid n2 in WALInsertLock. That means we should think
 about write ahead log tuning. Or pid n1 is in some clog-related
 LWLock, which means we need move clog to ramdisk. This is a stupid
 example, but it shows how useful LWLock tracing could be for DBAs.
 Even better idea is to collect daily LWLock distribution, find most
 frequent of them etc.

I think it's more complicated than that - but I also think it'd be a
great help for DBAs and us postgres hackers.

 An idea of this patch is to trace LWLocks with the lowest possible
 performance impact. We put integer lwLockID into procarray, then
 acquiring the LWLock we put its id to procarray and now we could pull
 procarray using a function to see if particular pid holds LWLock.

But a backend can hold more than one lwlock at the same time? I don't
think that's something we can ignore.

 Not
 perfect, but if we see sometimes somebody consumes a lot of particular
 LWLocks, we could investigate this matter in a more precise way using
 another tool. Something like that was implemented in the attached
 patch:
 
 issuing pgbench  -c 50 -t 1000 -j 50
 
 we have something like that:
 
 postgres=# select now(),* from pg_stat_lwlock ;
   now  | lwlockid | pid
 ---+--+--
  2014-10-01 15:11:43.848765+02 |   57 | 4257
 (1 row)

Hm. So you just collect the lwlockid and the pid? That doesn't sound
particularly interesting to me. In my opinion, you'd need at least:
* pid
* number of exclusive/shared acquirations
* number of exclusive/shared acquirations that had to wait
* total wait time of exclusive/shared acquirations

 Questions.
 
 1. I've decided to put pg_stat_lwlock into extension pg_stat_lwlock
 (simply for test purposes). Is it OK, or better to implement it
 somewhere inside pg_catalog or in another extension (for example
 pg_stat_statements)?

I personally am doubtful that it makes much sense to move this into an
extension. It'll likely be tightly enough interlinked to backend code
that I don't see the point. But I'd not be surprised if others feel
differently.

I generally don't think you'll get interesting data without a fair bit
of additional work.

The first problem that comes to my mind about collecting enough data is
that we have a very large number of lwlocks (fixed_number + 2 *
shared_buffers). One 'trivial' way of implementing this is to have a per
backend array collecting the information, and then a shared one
accumulating data from it over time. But I'm afraid that's not going to
fly :(. Hm. With the above sets of stats that'd be ~50MB per backend...

Perhaps we should somehow encode this different for individual lwlock
tranches? It's far less problematic to collect all this information for
all but the buffer lwlocks...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Escaping from blocked send() reprised.

2014-10-02 Thread Kyotaro HORIGUCHI

Hi,

   But this is imo prohibitive. Yes, we're doing it for a long while. But
   no, that's not ok. It actually prompoted me into prototyping the latch
   thing (in some other thread). I don't think existing practice justifies
   expanding it further.
  
  I see, in that case, this approach seems basically
  applicable. But if I understand correctly, this patch seems not
  to return out of the openssl code even when latch was found to be
  set in secure_raw_write/read.
 
 Correct. That's why I think it's the way forward. There's several
 problems now where the inability to do real things while reading/writing
 is a problem.
 
   I tried setting errno = ECONNRESET
  and it went well but seems a bad deed.
 
 Where and why did you do that?

The patch of this message.

http://www.postgresql.org/message-id/20140828.214704.93968088.horiguchi.kyot...@lab.ntt.co.jp

The reason for setting errno (instead of a variable for it) is to
trick openssl (or my_socck_write? I've forgot it..) into
recognizing as if the underneath send(2) have returned with any
uncontinueable error so it cannot be any of continueable errnos
(EINTR/EWOULDBLOCK/EAGAIN). Iy my faint memory, only avoiding
BIO_set_retry_write() in my_sock_write() dosn't work as expected
but it might be enough that my_sock_write returns -1 and doesn't
set BIO_set_retry_write().

The reason why ECONNNRESET is any of other errnos possible for
send(2)(*1) doesn't seem to fit the real situation, and the
blocked situation seems similar to resetted connection from the
view that it cannot continue to work due to external condition,
and it is used in be_tls_write() in a similary way.

Come to think of it, setting ECONNRESET is not so evil?


  secure_raw_write(Port *port, const void *ptr, size_t len)
  {
n = send(port-sock, ptr, len, 0);

if (!port-noblock  n  0  (errno == EWOULDBLOCK || errno == EAGAIN))
{
  w = WaitLatchOrSocket(MyProc-procLatch, ...
  
  if (w  WL_LATCH_SET)
  {
  ResetLatch(MyProc-procLatch);
  /*
  * Force a return, so interrupts can be processed when not
  * (possibly) underneath a ssl library.
  */
  errno = EINTR;
  (return n;  // n is negative)
  
  
  my_sock_write(BIO *h, const char *buf, int size)
  {
res = secure_raw_write(((Port *) h-ptr), buf, size);
BIO_clear_retry_flags(h);
if (res = 0)
{
  if (errno == EINTR || errno == EWOULDBLOCK || errno == EAGAIN)
  {
BIO_set_retry_write(h);
 
 Hm, this seems, besides one comment, the code from the last patch in my
 series. Do you have a particular question about it?

I didn't have a particluar qustion about it. This is cited only
in order to show the route to retrying.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit


On 09/15/2014 08:46 PM, Fabien COELHO wrote:



I'm not sure I like the idea of printing a percentage.  It might be
unclear what the denominator was if somebody feels the urge to work
back to the actual number of skipped transactions.  I mean, I guess
it's probably just the value you passed to -R, so maybe that's easy
enough, but then why bother dividing in the first place?  The user can
do that easily enough if they want the data that way.


Indeed skipped and late per second may have an unclear denominator. If
you divide by the time, the unit would be tps, but 120 tps performance
including 20 late tps, plus 10 skipped tps... I do not think it is that
clear. Reporting tps for transaction *not* performed looks strange.

Maybe late transactions could be given as a percentage of all processed
transactions in the interval. But for skipped the percentage of what? The
only number that would make sense is the total number of transactions
schedule in the interval, but that would mean that the denominator for
late would be different than the denominator for skipped, which is
basically uncomprehensible.


Hmm. I guess the absolute number makes sense, if you expect that there 
are normally zero skipped transactions, or at least a very small number. 
It's more like a good or no good indicator. Ok, I'm fine with that.


The version I'm now working on prints output like this:


$ ./pgbench -T10 -P1  --rate=1600 --latency-limit=10
starting vacuum...end.
progress: 1.0 s, 1579.0 tps, lat 2.973 ms stddev 2.493, lag 2.414 ms, 4 skipped
progress: 2.0 s, 1570.0 tps, lat 2.140 ms stddev 1.783, lag 1.599 ms, 0 skipped
progress: 3.0 s, 1663.0 tps, lat 2.372 ms stddev 1.742, lag 1.843 ms, 4 skipped
progress: 4.0 s, 1603.2 tps, lat 2.435 ms stddev 2.247, lag 1.902 ms, 13 skipped
progress: 5.0 s, 1540.9 tps, lat 1.845 ms stddev 1.270, lag 1.303 ms, 0 skipped
progress: 6.0 s, 1588.0 tps, lat 1.630 ms stddev 1.003, lag 1.097 ms, 0 skipped
progress: 7.0 s, 1577.0 tps, lat 2.071 ms stddev 1.445, lag 1.517 ms, 0 skipped
progress: 8.0 s, 1669.9 tps, lat 2.375 ms stddev 1.917, lag 1.846 ms, 0 skipped
progress: 9.0 s, 1636.0 tps, lat 2.801 ms stddev 2.354, lag 2.250 ms, 5 skipped
progress: 10.0 s, 1606.1 tps, lat 2.751 ms stddev 2.117, lag 2.197 ms, 2 skipped
transaction type: TPC-B (sort of)
scaling factor: 5
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 16034
number of transactions skipped: 28 (0.174 %)
number of transactions above the 10.0 ms latency limit: 70 (0.436 %)
latency average: 2.343 ms
latency stddev: 1.940 ms
rate limit schedule lag: avg 1.801 (max 9.994) ms
tps = 1603.370819 (including connections establishing)
tps = 1603.619536 (excluding connections establishing)


Those progress lines are 79 or 80 characters wide, so they *just* fit in 
a 80-char terminal. Of course, if any of the printed numbers were 
higher, it would not fit. I don't see how to usefully make it more 
terse, though. I think we can live with this - these days it shouldn't 
be a huge problem to enlare the terminal to make the output fit.


Here are new patches, again the first one is just refactoring, and the 
second one contains this feature. I'm planning to commit the first one 
shortly, and the second one later after people have had a chance to look 
at it.


Greg: As the author of pgbench-tools, what do you think of this patch? 
The log file format, in particular.


- Heikki

From 512fde5dc3fde5fc1368b3bf0c09e3ea8e022fad Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakan...@iki.fi
Date: Thu, 2 Oct 2014 12:58:14 +0300
Subject: [PATCH 1/2] Refactor pgbench log-writing code to a separate function.

The doCustom function was incredibly long, this makes it a little bit more
readable.
---
 contrib/pgbench/pgbench.c | 340 +++---
 1 file changed, 169 insertions(+), 171 deletions(-)

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 087e0d3..c14a577 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -347,6 +347,9 @@ static char *select_only = {
 static void setalarm(int seconds);
 static void *threadRun(void *arg);
 
+static void doLog(TState *thread, CState *st, FILE *logfile, instr_time *now,
+	  AggVals *agg);
+
 static void
 usage(void)
 {
@@ -1016,6 +1019,16 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 	PGresult   *res;
 	Command   **commands;
 	bool		trans_needs_throttle = false;
+	instr_time	now;
+
+	/*
+	 * gettimeofday() isn't free, so we get the current timestamp lazily the
+	 * first time it's needed, and reuse the same value throughout this
+	 * function after that. This also ensures that e.g. the calculated latency
+	 * reported in the log file and in the totals are the same. Zero means
+	 * not set yet.
+	 */
+	INSTR_TIME_SET_ZERO(now);
 
 top:
 	commands = sql_files[st-use_file];
@@ -1049,10 +1062,10 @@ top:
 
 	if (st-sleeping)
 	{

Re: [HACKERS] Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)

2014-10-02 Thread Ilya Kosmodemiansky

On Thu, Oct 2, 2014 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote:
 Not just from a oracle DBA POV ;). Generally.

sure

 Saying that, principally they mean an
 Oracle Wait Interface analogue. The Basic idea is to have counters or
 sensors all around database kernel to measure what a particular
 backend is currently waiting for and how long/how often it waits.

 Yes, I can see that. I'm not sure whether lwlocks are the primary point
 I'd start with though. In many cases you'll wait on so called
 'heavyweight' locks too...


I try to kill two birds with one stone: make some prepositional work
on main large topic and deliver some convenience about LWLock
diagnostics. Maybe I'm wrong, but it seems to me it is much easier
task to advocate some more desired feature: we have some heavyweight
locks diagnostics tools and they are better than for lwlocks.



 Suppose we have a PostgreSQL instance under heavy write workload, but
 we do not know any details. We could pull from time to time
 pg_stat_lwlock function which would say pid n1 currently in
 WALWriteLock and pid n2 in WALInsertLock. That means we should think
 about write ahead log tuning. Or pid n1 is in some clog-related
 LWLock, which means we need move clog to ramdisk. This is a stupid
 example, but it shows how useful LWLock tracing could be for DBAs.
 Even better idea is to collect daily LWLock distribution, find most
 frequent of them etc.

 I think it's more complicated than that - but I also think it'd be a
 great help for DBAs and us postgres hackers.


Sure it is more complicated, the example is stupid, just to show the point.


 An idea of this patch is to trace LWLocks with the lowest possible
 performance impact. We put integer lwLockID into procarray, then
 acquiring the LWLock we put its id to procarray and now we could pull
 procarray using a function to see if particular pid holds LWLock.

 But a backend can hold more than one lwlock at the same time? I don't
 think that's something we can ignore.


Yes, this one of the next steps. I have not figure out yet, how to do
it less painfully than LWLOCK_STATS does.


 I personally am doubtful that it makes much sense to move this into an
 extension. It'll likely be tightly enough interlinked to backend code
 that I don't see the point. But I'd not be surprised if others feel
 differently.


Thats why I asked this question, and also because I have no idea where
exactly put this functions inside backend if not into extension. But
probably there are some more important tasks with this work than
moving the function inside, I could do this later if it will be
necessary.


 I generally don't think you'll get interesting data without a fair bit
 of additional work.

Sure

 The first problem that comes to my mind about collecting enough data is
 that we have a very large number of lwlocks (fixed_number + 2 *
 shared_buffers). One 'trivial' way of implementing this is to have a per
 backend array collecting the information, and then a shared one
 accumulating data from it over time. But I'm afraid that's not going to
 fly :(. Hm. With the above sets of stats that'd be ~50MB per backend...

 Perhaps we should somehow encode this different for individual lwlock
 tranches? It's far less problematic to collect all this information for
 all but the buffer lwlocks...

That is a good point. There are actually two things to keep in mind:
i) user interface, ii) implementation

i) Personally, as a DBA, I do not see much sense in unaggregated list
of pid, lwlockid, wait_time or something like that.

Much better to have aggregation by pid and lwlockid, for instance:
- pid
- lwlockid
- lwlockname
- total_count (or number of exclusive/shared acquirations that had to
wait as you suggest, since we have a lot of lwlocks I am doubtful
about how important is the information about non-waiting lwlocks)

ii) Am I correct, that you suggest to go trough MainLWLockTranche and
retrieve all available lwlock information to some structure like
lwLockCell structure I've used in my patch? Something like hash
lwlocid-usagecount?


Regards,
Ilya


 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



-- 
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
i...@postgresql-consulting.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)

2014-10-02 Thread Ilya Kosmodemiansky

On Thu, Oct 2, 2014 at 5:25 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 It's not at all clear to me that a DTrace-like (or perf-based, rather)
 approach is unsafe, slow, or unsuitable for production use.

 With appropriate wrapper tools I think we could have quite a useful
 library of perf-based diagnostics and tracing tools for PostgreSQL.

It is not actually very slow, overhead is quite reasonable since we
want such comprehensive performance diagnostics. About stability, I
have had a couple of issues with postgres crushes with dtrace and dos
not without. Most of them was on FreeBSD, which is still in use by
many people and were caused actually by freebsd dtrace, but for me it
is quite enough to have doubts about keeping dtrace aware build in
production.


OK, OK -  maybe things were changed last couple of years or will
change soon - still dtrace/perf is well enough for those who is
familiar with it, but you need a really convenient wrapper to make
oracle/db2 DBA happy with using such approach.


 Resolving lock IDs to names might be an issue, though.

I am afraid it is


 --
  Craig Ringer   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



-- 
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
i...@postgresql-consulting.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] proposal: doc: simplify examples of dynamic SQL

2014-10-02 Thread Pavel Stehule

Hi

There are few less readable examples of dynamic SQL in plpgsql doc

like:

EXECUTE 'SELECT count(*) FROM '
|| tabname::regclass
|| ' WHERE inserted_by = $1 AND inserted = $2'
   INTO c
   USING checked_user, checked_date;

or

EXECUTE 'UPDATE tbl SET '
|| quote_ident(colname)
|| ' = $'
|| newvalue
|| '$ WHERE key = '
|| quote_literal(keyvalue);

We can show a examples based on format function only:

EXECUTE format('SELECT count(*) FROM %I'
   ' WHERE inserted_by = $1 AND inserted = $2',
tabname)
   INTO c
   USING checked_user, checked_date;

or

EXECUTE format('UPDATE tbl SET %I = newvalue WHERE key = %L',
colname, keyvalue)
or

EXECUTE format('UPDATE tbl SET %I = newvalue WHERE key = $1',
colname)
  USING keyvalue;

A old examples are very instructive, but little bit less readable and maybe
too complex for beginners.

Opinions?

Regards

Pavel

Re: [HACKERS] pgcrypto: PGP signatures

2014-10-02 Thread Marko Tiikkaja


On 10/2/14 1:47 PM, Heikki Linnakangas wrote:

I looked at this briefly, and was surprised that there is no support for
signing a message without encrypting it. Is that intentional? Instead of
adding a function to encrypt and sign a message, I would have expected
this to just add a new function for signing, and you could then pass it
an already-encrypted blob, or plaintext.


Yes, that's intentional.  The signatures are part of the encrypted data 
here, so you can't look at a message and determine who sent it.


There was brief discussion about this upthread (though no one probably 
added any links to those discussions into the commit fest app), and I 
still think that both types of signing would probably be valuable.  But 
this patch is already quite big, and I really have no desire to work on 
this sign anything functionality.  The pieces are there, though, so if 
someone wants to do it, I don't see why they couldn't.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)

* Craig Ringer (cr...@2ndquadrant.com) wrote:
  The patch https://commitfest.postgresql.org/action/patch_view?id=885
  (discussion starts here I hope -
  http://www.postgresql.org/message-id/4fe8ca2c.3030...@uptime.jp)
  demonstrates performance problems; LWLOCK_STAT,  LOCK_DEBUG and
  DTrace-like approach are slow, unsafe for production use and a bit
  clumsy for using by DBA.
 
 It's not at all clear to me that a DTrace-like (or perf-based, rather)
 approach is unsafe, slow, or unsuitable for production use.

I've certainly had it take production systems down (perf, specifically),
so I'd definitely consider it unsafe.  I wouldn't say it's unusable,
but it's certainly not what we should have as the end-goal for PG.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)

* Andres Freund (and...@2ndquadrant.com) wrote:
  1. I've decided to put pg_stat_lwlock into extension pg_stat_lwlock
  (simply for test purposes). Is it OK, or better to implement it
  somewhere inside pg_catalog or in another extension (for example
  pg_stat_statements)?
 
 I personally am doubtful that it makes much sense to move this into an
 extension. It'll likely be tightly enough interlinked to backend code
 that I don't see the point. But I'd not be surprised if others feel
 differently.

I agree that this doesn't make sense as an extension.

 I generally don't think you'll get interesting data without a fair bit
 of additional work.

I'm not sure about this..

 The first problem that comes to my mind about collecting enough data is
 that we have a very large number of lwlocks (fixed_number + 2 *
 shared_buffers). One 'trivial' way of implementing this is to have a per
 backend array collecting the information, and then a shared one
 accumulating data from it over time. But I'm afraid that's not going to
 fly :(. Hm. With the above sets of stats that'd be ~50MB per backend...

I was just going to suggest exactly this- a per-backend array which then
gets pushed into a shared area periodically.  Taking up 50MB per backend
is quite a bit though. :/

 Perhaps we should somehow encode this different for individual lwlock
 tranches? It's far less problematic to collect all this information for
 all but the buffer lwlocks...

Yeah, that seems like it would at least be a good approach to begin
with.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Replication identifiers, take 3

On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 An origin column in the table itself helps tremendously to debug issues with
 the replication system. In many if not most scenarios, I think you'd want to
 have that extra column, even if it's not strictly required.

I like a lot of what you wrote here, but I strongly disagree with this
part.  A good replication solution shouldn't require changes to the
objects being replicated.  The triggers that Slony and other current
logical replication solutions use are a problem not only because
they're slow (although that is a problem) but also because they
represent a user-visible wart that people who don't otherwise care
about the fact that their database is being replicated have to be
concerned with.  I would agree that some people might, for particular
use cases, want to include origin information in the table that the
replication system knows about, but it shouldn't be required.

When you look at the replication systems that we have today, you've
basically got streaming replication, which is high-performance and
fairly hands-off (at least once you get it set up properly; that part
can be kind of a bear) but can't cross versions let alone database
systems and requires that the slaves be strictly read-only.  Then on
the flip side you've got things like Slony, Bucardo, and others.  Some
of these allow multi-master; all of them at least allow table-level
determination of which server has the writable copy.  Nearly all of
them are cross-version and some even allow replication into
non-PostgreSQL systems.  But they are slow *and administratively
complex*.  If we're able to get something that feels like streaming
replication from a performance and administrative complexity point of
view but can be cross-version and allow at least some writes on
slaves, that's going to be an epic leap forward for the project.

In my mind, that means it's got to be completely hands-off from a
schema design point of view: you should be able to start up a database
and design it however you want, put anything you like into it, and
then decide later that you want to bolt logical replication on top of
it, just as you can for streaming physical replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Wed, Oct 1, 2014 at 4:56 PM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 On Mon, Sep 29, 2014 at 12:05 PM, Stephen Frost sfr...@snowman.net wrote:
  Perhaps I'm just being a bit over the top, but all this per-character
  work feels a bit ridiculous..  When we're using MAXIMUM_ALIGNOF, I
  suppose it's not so bad, but is there no hope to increase that and make
  this whole process more efficient?  Just a thought.

 I'm not sure I understand what you're getting at here.

 Was just thinking that we might be able to work out what needs to be
 done without having to actually do it on a per-character basis.  That
 said, I'm not sure it's really worth the effort given that we're talking
 about at most 8 bytes currently.  I had originally been thinking that we
 might increase the minimum size as it might make things more efficient,
 but it's not clear that'd really end up being the case either and,
 regardless, it's probably not worth worrying about at this point.

I'm still not entirely sure we're on the same page.  Most of the data
movement for shm_mq is done via memcpy(), which I think is about as
efficient as it gets.  The detailed character-by-character handling
only really comes up when shm_mq_send() is passed multiple chunks with
lengths that are not a multiple of MAXIMUM_ALIGNOF.  Then we have to
fiddle a bit more.  So making MAXIMUM_ALIGNOF bigger would actually
cause us to do more fiddling, not less.

When I originally designed this queue, I had the idea in mind that
life would be simpler if the queue head and tail pointers always
advanced in multiples of MAXIMUM_ALIGNOF.  That didn't really work out
as well as I'd hoped; maybe I would have been better off having the
queue pack everything in tightly and ignore alignment.  However, there
is some possible efficiency advantage of the present system: when a
message fits in the queue without wrapping, shm_mq_receive() returns a
pointer to the message, and the caller can assume that message is
properly aligned.  If the queue didn't maintain alignment internally,
you'd need to do a copy there before accessing anything inside the
message that might be aligned.  Granted, we don't have any code that
takes advantage of that yet, at least not in core, but the potential
is there.  Still, I may have made the wrong call.  But, I don't think
it's the of this patch to rework the whole framework; we can do that
in the future after this has a few more users and the pros and cons of
various approaches are (hopefully) more clear.  It's not a complex
API, so substituting a different implementation later on probably
wouldn't be too hard.

Anyway, it sounds like you're on board with me committing the first
patch of the series, so I'll do that next week absent objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Oct 1, 2014 at 4:56 PM, Stephen Frost sfr...@snowman.net wrote:
  Was just thinking that we might be able to work out what needs to be
  done without having to actually do it on a per-character basis.  That
  said, I'm not sure it's really worth the effort given that we're talking
  about at most 8 bytes currently.  I had originally been thinking that we
  might increase the minimum size as it might make things more efficient,
  but it's not clear that'd really end up being the case either and,
  regardless, it's probably not worth worrying about at this point.
 
 I'm still not entirely sure we're on the same page.  Most of the data
 movement for shm_mq is done via memcpy(), which I think is about as
 efficient as it gets.

Right- agreed.  I had originally thought we were doing things on a
per-MAXIMUM_ALIGNOF-basis somewhere else, but that appears to be an
incorrect assumption (which I'm glad for).

 The detailed character-by-character handling
 only really comes up when shm_mq_send() is passed multiple chunks with
 lengths that are not a multiple of MAXIMUM_ALIGNOF.  Then we have to
 fiddle a bit more.  So making MAXIMUM_ALIGNOF bigger would actually
 cause us to do more fiddling, not less.

Sorry- those were two independent items.  Regarding the per-character
work, I was thinking we could work out the number of characters which
need to be moved and then use memcpy directly rather than doing the
per-character work, but as noted, we shouldn't be going through that
loop very often or for very many iterations anyway, and we have to deal
with moving forward through the iovs, so we'd still have to have a loop
there regardless.

 When I originally designed this queue, I had the idea in mind that
 life would be simpler if the queue head and tail pointers always
 advanced in multiples of MAXIMUM_ALIGNOF.  That didn't really work out
 as well as I'd hoped; maybe I would have been better off having the
 queue pack everything in tightly and ignore alignment.  However, there
 is some possible efficiency advantage of the present system: when a
 message fits in the queue without wrapping, shm_mq_receive() returns a
 pointer to the message, and the caller can assume that message is
 properly aligned.  If the queue didn't maintain alignment internally,
 you'd need to do a copy there before accessing anything inside the
 message that might be aligned.  Granted, we don't have any code that
 takes advantage of that yet, at least not in core, but the potential
 is there.  Still, I may have made the wrong call.  But, I don't think
 it's the of this patch to rework the whole framework; we can do that
 in the future after this has a few more users and the pros and cons of
 various approaches are (hopefully) more clear.  It's not a complex
 API, so substituting a different implementation later on probably
 wouldn't be too hard.

Agreed.

 Anyway, it sounds like you're on board with me committing the first
 patch of the series, so I'll do that next week absent objections.

Works for me.

Thanks!

Stephen


signature.asc
Description: Digital signature

[HACKERS] Log notice that checkpoint is to be written on shutdown

2014-10-02 Thread Michael Banck

Hi,

we have seen repeatedly that users can be confused about why PostgreSQL
is not shutting down even though they requested it.  Usually, this is
because `log_checkpoints' is not enabled and the final checkpoint is
being written, delaying shutdown. As no message besides shutting down
is written to the server log in this case, we even had users believing
the server was hanging and pondering killing it manually.

In order to alert those users that a checkpoint is being written, I
propose to add a log message waiting for checkpoint ... on shutdown,
even if log_checkpoints is disabled, as this particular checkpoint might
be important information.

I've attached a trivial patch for this, should it be added to the next
commitfest?


Cheers,

Michael

-- 
Michael Banck
Projektleiter / Berater
Tel.: +49 (2161) 4643-171
Fax:  +49 (2161) 4643-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a4dbb9..78483ca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8085,10 +8085,14 @@ CreateCheckPoint(int flags)
 
 	/*
 	 * If enabled, log checkpoint start.  We postpone this until now so as not
-	 * to log anything if we decided to skip the checkpoint.
+	 * to log anything if we decided to skip the checkpoint.  If we are during
+	 * shutdown and checkpoints are not being logged, add a log message that a 
+	 * checkpoint is to be written and shutdown is potentially delayed.
 	 */
 	if (log_checkpoints)
 		LogCheckpointStart(flags, false);
+	else if (flags  CHECKPOINT_IS_SHUTDOWN)
+		ereport(LOG, (errmsg(waiting for checkpoint ...)));
 
 	TRACE_POSTGRESQL_CHECKPOINT_START(flags);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time measurement format - more human readable

2014-10-02 Thread Bogdan Pilch

 On 9/29/14, 1:08 AM, Andres Freund wrote:
 On 2014-09-28 20:32:30 -0400, Gregory Smith wrote:
 There are already a wide range of human readable time interval output
 formats available in the database; see the list at 
 http://www.postgresql.org/docs/current/static/datatype-datetime.html#INTERVAL-STYLE-OUTPUT-TABLE
 He's talking about psql's \timing...
 
 I got that.  My point was that even though psql's timing report is
 kind of a quick thing hacked into there, if it were revised I'd
 expect two things will happen eventually:
 
 -Asking if any of the the interval conversion code can be re-used
 for this purpose, rather than adding yet another custom to one code
 path standard.
 
 -Asking if this should really just be treated like a full interval
 instead, and then overlapping with a significant amount of that
 baggage so that you have all the existing format choices.

That's actually a good idea.
So what you're sayig is that if I come up with some nice way of
setting customized time output format, keeping the default the way it
is now, then it would be worth considering?

Now I understand why it says that a discussion is recommended before
implementing and posting. ;-)

bogdan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

On Thu, Oct 2, 2014 at 9:54 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Alvaro Herrera wrote:
 So in essence what we're going to do is that the balance mechanism
 considers only tables that don't have per-table configuration options;
 for those that do, we will use the values configured there without any
 changes.

 I'll see about implementing this and making sure it finds its way to
 9.4beta3.

 Here's a patch that makes it work as proposed.

 How do people feel about back-patching this?  On one hand it seems
 there's a lot of fear of changing autovacuum behavior in back branches,
 because for many production systems it has carefully been tuned; on the
 other hand, it seems hard to believe that anyone has tuned the system to
 work sanely given how insanely per-table options behave in the current
 code.

I agree with both of those arguments.  I have run into very few
customers who have used the autovacuum settings to customize behavior
for particular tables, and anyone who hasn't should see no change
(right?), so my guess is that the practical impact of the change will
be pretty limited.  On the other hand, it's a clear behavior change.
Someone could have set the per-table limit to something enormous and
never suffered from that setting because it has basically no effect as
things stand right now today; and that person might get an unpleasant
surprise when they update.

I would at least back-patch it to 9.4.  I could go either way on
whether to back-patch into older branches.  I lean mildly in favor of
it at the moment, but with considerable trepidation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] port/atomics/arch-*.h are missing from installation

2014-10-02 Thread Kohei KaiGai

I got the following error when I try to build my extension
towards the latest master branch.

Is the port/atomics/*.h files forgotten on make install?

[kaigai@magro pg_strom]$ make
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -fpic -Wall -O0
-DPGSTROM_DEBUG=1 -I. -I./ -I/usr/local/pgsql/include/server
-I/usr/local/pgsql/include/internal -D_GNU_SOURCE   -c -o shmem.o
shmem.c
In file included from /usr/local/pgsql/include/server/storage/barrier.h:21:0,
 from shmem.c:18:
/usr/local/pgsql/include/server/port/atomics.h:65:36: fatal error:
port/atomics/arch-x86.h: No such file or directory
 # include port/atomics/arch-x86.h
^
compilation terminated.
make: *** [shmem.o] Error 1


Even though the source directory has header files...

[kaigai@magro sepgsql]$ find ./src | grep atomics
./src/include/port/atomics
./src/include/port/atomics/generic-xlc.h
./src/include/port/atomics/arch-x86.h
./src/include/port/atomics/generic-acc.h
./src/include/port/atomics/arch-ppc.h
./src/include/port/atomics/generic.h
./src/include/port/atomics/arch-hppa.h
./src/include/port/atomics/generic-msvc.h
./src/include/port/atomics/arch-ia64.h
./src/include/port/atomics/generic-sunpro.h
./src/include/port/atomics/arch-arm.h
./src/include/port/atomics/generic-gcc.h
./src/include/port/atomics/fallback.h
./src/include/port/atomics.h
./src/backend/port/atomics.c

the install destination has only atomics.h

[kaigai@magro sepgsql]$ find /usr/local/pgsql/include | grep atomics
/usr/local/pgsql/include/server/port/atomics.h

The attached patch is probably right remedy.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


pgsql-v9.5-fixup-makefile-for-atomics.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inefficient barriers on solaris with sun cc

On 2014-09-26 10:28:21 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa o...@ohmu.fi wrote:
  So you think a read barrier is the same thing as an acquire barrier
  and a write barrier is the same as a release barrier?  That would be
  surprising.  It's certainly not true in general.
 
  The above doc describes the difference: read barrier requires loads before
  the barrier to be completed before loads after the barrier - an acquire
  barrier is the same, but it also requires loads to be complete before stores
  after the barrier.
 
  Similarly write barrier requires stores before the barrier to be completed
  before stores after the barrier - a release barrier is the same, but it also
  requires loads before the barrier to be completed before stores after the
  barrier.
 
  So acquire is read + loads-before-stores and release is write +
  loads-before-stores.
 
 Hmm.  My impression was that an acquire barrier means that loads and
 stores can migrate forward across the barrier but not backward; and
 that a release barrier means that loads and stores can migrate
 backward across the barrier but not forward.

It's actually more complex than that :(

Simple things first:

Oracle's definition seems pretty iron clad:
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
__machine_acq_barrier is a clear superset of __machine_r_barrier and
__machine_rel_barrier is a clear superset of __machine_w_barrier

And that's what we're essentially discussing, no? That said, there seems
to be no reason to avoid using __machine_r/w_barrier().


But for the reason why I defined pg_read_barrier/write_barrier to
__atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):

The C11/C++11 definition it's made for is hellishly hard to
understand. There's very subtle differences between acquire/release
operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
parts of the standards. I think it essentially guarantees the mapping
we're talking about, but it's not entirely clear.

The way acquire/release fences are defined is that they form a
'synchronizes-with' relationship with each other. Which would, I think,
be sufficient given that without a release like operation on the other
thread a read/wrie barrier isn't worth much. But there's a rub in that
it requires a atomic operation involved somehere to give that guarantee.

I *did* check that the emitted code on relevant architectures is sane,
but that doesn't guarantee anything for the future.

Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
definitely guaranteeing what we need, even if superflously heavy on some
platforms. It still is significantly more efficient than
__sync_synchronize() which is what was used before. I.e. it generates no
code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
otherwise, although I don't know why) and similar on ia64.

As a reference, relevant standard sections are:
C11: 5.1.2.4 5); 7.17.4
C++11: 29.3; 1.10
Not that we can rely on those, but I think it's a good thing to orient on.

 I'm actually not really sure what this means unless the barrier also
 does something in and of itself.

 For example, consider this:
 
 some stuff
 CAS(lock, 0, 1) // i am an acquire barrier
 more stuff
 lock = 0 // i am a release barrier
 even more stuff
 
 If the CAS() and lock = 0 instructions were FULL barriers, then we'd
 be saying that the stuff that happens in the critical section needs to
 be exactly more stuff.  But if they are acquire and release
 barriers, respectively, then the CPU is allowed to move some stuff
 or even more stuff into the critical section; but what it can't do
 is move more stuff out.

 Now if you just have a naked acquire barrier that is not doing
 anything itself, I don't really know what the semantics of that should
 be.

Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

 Say I want to appear to only change things while flag is 1, so I
 write this code:
 
 flag = 1
 acquire barrier
 things++
 release barrier
 flag = 0
 
 With the definition you (and Oracle) propose

As written above, I don't think that applies to oracle's definition?

 this won't work, because
 there's nothing to keep the modification of things from being
 reordered before flag = 1.  What good is that?  Apparently, I don't
 have any idea!

I hope it's a bit clearer now?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On 2014-09-25 10:42:29 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
  On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   That leads me to wonder: Have you measured different, lower, number of
   buffer mapping locks? 128 locks is, if we'd as we should align them
   properly, 8KB of memory. Common L1 cache sizes are around 32k...
 
  Amit has some results upthread showing 64 being good, but not as good
  as 128.  I haven't verified that myself, but have no reason to doubt
  it.
 
  How about you push the spinlock change and I crosscheck the partition
  number on a multi socket x86 machine? Seems worthwile to make sure that
  it doesn't cause problems on x86. I seriously doubt it'll, but ...
 
 OK.

Given that the results look good, do you plan to push this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 OK.

 Given that the results look good, do you plan to push this?

By this, you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

If so, and if you don't have any reservations, yeah I'll go change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote:
  OK.
 
  Given that the results look good, do you plan to push this?
 
 By this, you mean the increase in the number of buffer mapping
 partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define 
MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
or something like that?

 If so, and if you don't have any reservations, yeah I'll go change it.

Yes, I'm happy with going forward.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inefficient barriers on solaris with sun cc

On Thu, Oct 2, 2014 at 10:34 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's actually more complex than that :(

 Simple things first:

 Oracle's definition seems pretty iron clad:
 http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
 __machine_acq_barrier is a clear superset of __machine_r_barrier and
 __machine_rel_barrier is a clear superset of __machine_w_barrier

 And that's what we're essentially discussing, no? That said, there seems
 to be no reason to avoid using __machine_r/w_barrier().

So let's use those, then.

 But for the reason why I defined pg_read_barrier/write_barrier to
 __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):

 The C11/C++11 definition it's made for is hellishly hard to
 understand. There's very subtle differences between acquire/release
 operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
 parts of the standards. I think it essentially guarantees the mapping
 we're talking about, but it's not entirely clear.

 The way acquire/release fences are defined is that they form a
 'synchronizes-with' relationship with each other. Which would, I think,
 be sufficient given that without a release like operation on the other
 thread a read/wrie barrier isn't worth much. But there's a rub in that
 it requires a atomic operation involved somehere to give that guarantee.

 I *did* check that the emitted code on relevant architectures is sane,
 but that doesn't guarantee anything for the future.

 Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
 definitely guaranteeing what we need, even if superflously heavy on some
 platforms. It still is significantly more efficient than
 __sync_synchronize() which is what was used before. I.e. it generates no
 code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
 otherwise, although I don't know why) and similar on ia64.

A fully barrier on x86 should be an mfence, right?  With only a
compiler barrier, you have loads ordered with respect to loads and
stores ordered with respect to stores, but the load/store ordering
isn't fully defined.

 Which is why these acquire/release fences, in contrast to
 acquire/release operations, have more guarantees... You put your finger
 right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

 Say I want to appear to only change things while flag is 1, so I
 write this code:

 flag = 1
 acquire barrier
 things++
 release barrier
 flag = 0

 With the definition you (and Oracle) propose
 this won't work, because
 there's nothing to keep the modification of things from being
 reordered before flag = 1.  What good is that?  Apparently, I don't
 have any idea!

 As written above, I don't think that applies to oracle's definition?

Oracle's definition doesn't look sufficient there.  The acquire
barrier guarantees that the load operations before the barrier will be
completed before the load and store operations after the barrier, but
the only operation before the barrier is a store, not a load, so it
guarantees nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  OK.
 
  Given that the results look good, do you plan to push this?

 By this, you mean the increase in the number of buffer mapping
 partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

 Yes. Now that I think about it I wonder if we shouldn't define 
 MAX_SIMUL_LWLOCKS like
 #define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
 or something like that?

Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
thing, and I don't see any reason to assume that, even if we're making
it true for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] port/atomics/arch-*.h are missing from installation

Hi,

On 2014-10-02 23:33:36 +0900, Kohei KaiGai wrote:
 I got the following error when I try to build my extension
 towards the latest master branch.
 
 Is the port/atomics/*.h files forgotten on make install?

You're right.

 The attached patch is probably right remedy.

I've changed the order to be alphabetic, but otherwise it looks
good. Pushed.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Oct 2, 2014 at 9:54 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
  Alvaro Herrera wrote:
  So in essence what we're going to do is that the balance mechanism
  considers only tables that don't have per-table configuration options;
  for those that do, we will use the values configured there without any
  changes.
 
  I'll see about implementing this and making sure it finds its way to
  9.4beta3.
 
  Here's a patch that makes it work as proposed.
 
  How do people feel about back-patching this?  On one hand it seems
  there's a lot of fear of changing autovacuum behavior in back branches,
  because for many production systems it has carefully been tuned; on the
  other hand, it seems hard to believe that anyone has tuned the system to
  work sanely given how insanely per-table options behave in the current
  code.
 
 I agree with both of those arguments.  I have run into very few
 customers who have used the autovacuum settings to customize behavior
 for particular tables, and anyone who hasn't should see no change
 (right?), so my guess is that the practical impact of the change will
 be pretty limited.  On the other hand, it's a clear behavior change.
 Someone could have set the per-table limit to something enormous and
 never suffered from that setting because it has basically no effect as
 things stand right now today; and that person might get an unpleasant
 surprise when they update.
 
 I would at least back-patch it to 9.4.  I could go either way on
 whether to back-patch into older branches.  I lean mildly in favor of
 it at the moment, but with considerable trepidation.

I'm fine with putting it into 9.4.  I'm not sure that I see the value in
changing the back-branches and then having to deal with the well, if
you're on 9.3.5 then X, but if you're on 9.3.6 then Y or having to
figure out how to deal with the documentation for this.

Has there been any thought as to what pg_upgrade should do..?

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Log notice that checkpoint is to be written on shutdown

2014-10-02 Thread David G Johnston

Michael Banck-2 wrote
Hi,

we have seen repeatedly that users can be confused about why PostgreSQL
is not shutting down even though they requested it. Usually, this is
because `log_checkpoints' is not enabled and the final checkpoint is
being written, delaying shutdown. As no message besides shutting down
is written to the server log in this case, we even had users believing
the server was hanging and pondering killing it manually.

In order to alert those users that a checkpoint is being written, I
propose to add a log message waiting for checkpoint ... on shutdown,
even if log_checkpoints is disabled, as this particular checkpoint might
be important information.

I've attached a trivial patch for this, should it be added to the next
commitfest?

Peeking at this provokes a couple of novice questions:

While apparently it is impossible to have a checkpoint to log at recovery
end the decision to use the bitwise-OR instead of the local shutdown bool
seemed odd at first...

LogCheckpointStart creates the log entry unconditionally - leaving the
caller responsible for evaluating log_checkpoints - while LogCheckpointEnd
has the log_checkpoints condition built into it. I mention this because by
the same argument advocating the logging of the checkpoint start we should
also log its end. In order to do that here, though, we have to change
log_checkpoints before calling LogCheckpointEnd. Now, since we going to
shutdown anyway that seems benign (and forecloses on any need to set it back
to the prior value) but its still ugly.

Just some thoughts...

The rationale makes perfect sense. +1

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Log-notice-that-checkpoint-is-to-be-written-on-shutdown-tp5821394p5821417.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inefficient barriers on solaris with sun cc

On 2014-10-02 10:55:06 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:34 AM, Andres Freund and...@2ndquadrant.com wrote:
  It's actually more complex than that :(
 
  Simple things first:
 
  Oracle's definition seems pretty iron clad:
  http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
  __machine_acq_barrier is a clear superset of __machine_r_barrier and
  __machine_rel_barrier is a clear superset of __machine_w_barrier
 
  And that's what we're essentially discussing, no? That said, there seems
  to be no reason to avoid using __machine_r/w_barrier().
 
 So let's use those, then.

Right, I've never contended that.

  But for the reason why I defined pg_read_barrier/write_barrier to
  __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):
 
  The C11/C++11 definition it's made for is hellishly hard to
  understand. There's very subtle differences between acquire/release
  operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
  parts of the standards. I think it essentially guarantees the mapping
  we're talking about, but it's not entirely clear.
 
  The way acquire/release fences are defined is that they form a
  'synchronizes-with' relationship with each other. Which would, I think,
  be sufficient given that without a release like operation on the other
  thread a read/wrie barrier isn't worth much. But there's a rub in that
  it requires a atomic operation involved somehere to give that guarantee.
 
  I *did* check that the emitted code on relevant architectures is sane,
  but that doesn't guarantee anything for the future.
 
  Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
  definitely guaranteeing what we need, even if superflously heavy on some
  platforms. It still is significantly more efficient than
  __sync_synchronize() which is what was used before. I.e. it generates no
  code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
  otherwise, although I don't know why) and similar on ia64.
 
 A fully barrier on x86 should be an mfence, right?

Right. I've not talked about changing full barrier semantics. What I was
referring to is that until the atomics patch we always redefine
read/write barriers to be full barriers when using gcc intrinsics.

 With only a compiler barrier, you have loads ordered with respect to
 loads and stores ordered with respect to stores, but the load/store
 ordering isn't fully defined.

Yes.

  Which is why these acquire/release fences, in contrast to
  acquire/release operations, have more guarantees... You put your finger
  right onto the spot.
 
 But, uh, we still don't seem to know what those guarantees actually ARE.

Paired together they form a synchronized-with relationship. Problem #1
is that the standard's language isn't, to me at least, clear if there's
not some case where that's not the case. Problem #2 is that our current
README.barrier definition doesn't actually require barriers to be
paired. Which imo is bad, but still a fact.

The definition of ACQ_REL is pretty clearly sufficient imo: Full
barrier in both directions and synchronizes with acquire loads and
release stores in another thread..

  Say I want to appear to only change things while flag is 1, so I
  write this code:
 
  flag = 1
  acquire barrier
  things++
  release barrier
  flag = 0
 
  With the definition you (and Oracle) propose
  this won't work, because
  there's nothing to keep the modification of things from being
  reordered before flag = 1.  What good is that?  Apparently, I don't
  have any idea!
 
  As written above, I don't think that applies to oracle's definition?
 
 Oracle's definition doesn't look sufficient there.

Perhaps I'm just not understanding what you want to show with this
example. This started as a discussion of comparing acquire/release with
read/write barriers, right? Or are you generally wondering about the
point acquire/release barriers?

 The acquire
 barrier guarantees that the load operations before the barrier will be
 completed before the load and store operations after the barrier, but
 the only operation before the barrier is a store, not a load, so it
 guarantees nothing.

Well, 'acquire' operations always have to related to a load. That's why
standalone 'acquire fences' or 'acquire barriers' are more heavyweight
than just a acquiring read.

And realistically, in the above example, you'd have to read flag to see
that it's not already 1, right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

On 2014-10-02 10:56:05 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
  On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   OK.
  
   Given that the results look good, do you plan to push this?
 
  By this, you mean the increase in the number of buffer mapping
  partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
 
  Yes. Now that I think about it I wonder if we shouldn't define 
  MAX_SIMUL_LWLOCKS like
  #define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
  or something like that?
 
 Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
 thing, and I don't see any reason to assume that, even if we're making
 it true for now.

The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and
NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to
be held at the same time. It doesn't seem friendly to users
experimenting with changing this to know about a define that's private
to lwlock.c.
But I'm fine with doing this not at all or separately - there's no need
to actually do it together. 

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inefficient barriers on solaris with sun cc