Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-05-02 Thread Fabien COELHO


Hello,


The counters are updated when the transaction is finished anyway?


Yes, but the thread does not know it's time to write the results until
it completes the first transaction after the interval ends ...

Let's say the very first query in thread #1 takes a minute for some
reason, while the other threads process 100 transactions per second. So
before the thread #1 can report 0 transactions for the first second, the
other threads already have results for the 60 intervals.

I think there's no way to make this work except for somehow tracking
timestamp of the last submitted results for each thread, and only
flushing results older than the minimum of the timestamps. But that's
not trivial - it certainly is more complicated than just writing into a
shared file descriptor.


I agree that such an approach this would be horrible for a very limited 
value. However I was suggesting that a transaction is counted only when it 
is finished, so the aggregated data is to be understood as refering to 
finished transactions in the interval, and what is in progress would be 
counted in the next interval anyway.



Merging results for each transaction would not have this issue, but it
would also use the lock much more frequently, and that seems like a
pretty bad idea (especially for the workloads with short transactions
that you suggested are bad match for detailed log - this would make the
aggregated log bad too).

Also notice that with all the threads will try to merge the data (and
thus acquire the lock) at almost the same time - this is especially true
for very short transactions. I would be surprised if this did not cause
issues on read-only workloads with large numbers of threads.


ISTM that the aggregated version should fare better than the detailed log,
whatever is done: the key performance issue is because fprintf is slow, 
with aggregated log these are infrequent, and only arithmetic remains in a 
critical section.



(2) The feature would not be available for the thread-emulation with
this approach, but I do not see this as a particular issue as I
think that it is pretty much only dead code and a maintenance burden.


I'm not willing to investigate that, nor am I willing to implement
another feature that works only sometimes (I've done that in the past,
and I find it a bad practice).


[...]


After the small discussion I triggered, I've submitted a patch to drop 
thread fork-emulation from pgbench.



[...]
Also, if the lock for the shared buffer is cheaper than the lock
required for fprintf, it may still be an improvement.


Yep. fprintf does a lot of processing, so it is the main issue.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-05-02 Thread Tomas Vondra

Hi,

On 05/02/15 15:30, Fabien COELHO wrote:


Hello,


The counters are updated when the transaction is finished anyway?


Yes, but the thread does not know it's time to write the results until
it completes the first transaction after the interval ends ...

Let's say the very first query in thread #1 takes a minute for some
reason, while the other threads process 100 transactions per second. So
before the thread #1 can report 0 transactions for the first second, the
other threads already have results for the 60 intervals.

I think there's no way to make this work except for somehow tracking
timestamp of the last submitted results for each thread, and only
flushing results older than the minimum of the timestamps. But that's
not trivial - it certainly is more complicated than just writing into a
shared file descriptor.


I agree that such an approach this would be horrible for a very limited
value. However I was suggesting that a transaction is counted only when
it is finished, so the aggregated data is to be understood as refering
to finished transactions in the interval, and what is in progress
would be counted in the next interval anyway.


That only works if every single transaction is immediately written into 
the shared buffer/file, but that would require acquiring a lock shared 
by all the threads. And that's not really free - for cases with many 
clients doing tiny transactions, this might be a big issue, for example.


That's why I suggested that each client uses a shared buffer for the 
results, and only submits the results once the interval is over. 
Submitting the result however happens on the first transaction from the 
next interval. If the transaction is long, the results would not be 
submitted.


It might be done in the other direction, though - the writer thread 
might collect current results at the end of the interval.



Merging results for each transaction would not have this issue, but
it would also use the lock much more frequently, and that seems
like a pretty bad idea (especially for the workloads with short
transactions that you suggested are bad match for detailed log -
this would make the aggregated log bad too).

Also notice that with all the threads will try to merge the data
(and thus acquire the lock) at almost the same time - this is
especially true for very short transactions. I would be surprised
if this did not cause issues on read-only workloads with large
numbers of threads.


ISTM that the aggregated version should fare better than the
detailed log, whatever is done: the key performance issue is because
fprintf is slow, with aggregated log these are infrequent, and only
arithmetic remains in a critical section.


Probably.


(2) The feature would not be available for the thread-emulation with
this approach, but I do not see this as a particular issue as I
think that it is pretty much only dead code and a maintenance burden.


I'm not willing to investigate that, nor am I willing to implement
another feature that works only sometimes (I've done that in the past,
and I find it a bad practice).


[...]


After the small discussion I triggered, I've submitted a patch to drop
thread fork-emulation from pgbench.


OK, good.




[...]
Also, if the lock for the shared buffer is cheaper than the lock
required for fprintf, it may still be an improvement.


Yep. fprintf does a lot of processing, so it is the main issue.


The question is whether the processing happens while holding the lock, 
though. I don't think that's the case.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-05-02 Thread Fabien COELHO


It might be done in the other direction, though - the writer thread 
might collect current results at the end of the interval.


Yep, you can indeed accumulate per thread and sum on the end of the 
interval, but a lock is still needed if you want exact figures.


ISTM that it is what is done for --progress, without even bothering to 
lock anything (even if transactions are fast and some inconsistency arise, 
this is just for showing how things are faring on stderr, so no big deal).



fprintf does a lot of processing, so it is the main issue.


The question is whether the processing happens while holding the lock, 
though. I don't think that's the case.


Alas I'm pretty nearly sure it is necessarily the case, there is no 
intermediate buffer in fprintf, the buffering is done trough the file 
handler so a lock must be acquire throughout format processing.


https://www.gnu.org/software/libc/manual/html_node/Streams-and-Threads.html#Streams-and-Threads

The POSIX standard requires that by default the stream operations are 
atomic.  I.e., issuing two stream operations for the same stream in two 
threads at the same time will cause the operations to be executed as if 
they were issued sequentially. The buffer operations performed while 
reading or writing are protected from other uses of the same stream. To do 
this each stream has an internal lock object which has to be (implicitly) 
acquired before any work can be done. 


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-29 Thread Fabien COELHO


Hello Tomas,


The results are as follow:

 * 1 thread 33 runs median tps (average is consistent):
 - no logging:22062
 - separate logging:  19360  (-12.2%)
 - merged logging:19326  (-12.4%, not significant from previous)


Interesting. What hardware is this?


Dell PowerEdge R720 with 2 Intel Xeon CPU E5-2660 2.20GHz (8 cores and 16 
threads per processor, so 32 threads in total), running with Linux 3.13 
(Ubuntu trusty).



I wouldn't be surprised by this behavior on a multi-socket system, [...]


There are 2 sockets.


So my overall conclusion is:

(1) The simple thread-shared file approach would save pgbench from
post-processing merge-sort heavy code, for a reasonable cost.


No it wouldn't - you're missing the fact that the proposed approach
(shared file + fprintf) only works with raw transaction log.

It does not work with aggregated log - the threads would have to somehow
track the progress of the other threads somehow, in a very non-trivial
way (e.g. what if one of the threads executes a long query, and thus
does not send the results for a long time?).


The counters are updated when the transaction is finished anyway?

Another option would be to update shared aggregated results, but that 
requires locking.


That is what I had in mind. ISTM that the locking impact would be much 
lower than for logging, the data is just locked for a counter update, and 
if counters are per-thread, a conflict may only occur when the data are 
gathered for actual logging, which would be rather infrequent. Even if the 
counters are shared, the locking would be for small time that would imply 
a low conflict probability. So I do not see this one as a significant 
performance issue.



(2) The feature would not be available for the thread-emulation with
this approach, but I do not see this as a particular issue as I
think that it is pretty much only dead code and a maintenance burden.


I'm not willing to investigate that, nor am I willing to implement
another feature that works only sometimes (I've done that in the past,
and I find it a bad practice).


Hmmm. Keeping an obsolete feature with significant impact on how other 
features can be implemented, so basically a maintenance burden, does not 
look like best practice *either*.



If someone else is willing to try to eliminate the thread emulation, I
won't object to that.


Hmmm. I'll try to trigger a discussion in another thread to test the idea.

But as I pointed out above, simple fprintf is not going to work for the 
aggregated log - solving that will need more code (e.g. maintaining 
aggregated results for all threads, requiring additional locking etc).


The code for that is probably simple and short, and my wish is to try to 
avoid an external merge sort post processing, if possible, which is not 
especially cheap anyway, neither in code nor in time.



(3) Optimizing doLog from its current fprintf-based implementation
may be a good thing.


That's probably true. The simplest thing we can do right now is
buffering the data into larger chunks and writing those chunks.
That amortizes the costs of locking.


If it is buffered in the process, that would mean more locking. If it is 
buffered per threads that would result in out of order logs. Would that be 
an issue? It would be fine with me.



Using O_APPEND, as suggested by Andres, seems like a promising idea.


I tried that with a shared file handle, but the impact seemed negligeable. 
The figures I reported used it, btw.


I also tried to open the same file in append mode from all threads, with 
positive performance effects, but then the flush did not occur at line 
boundaries to there was some mangling in the result.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-29 Thread Fabien COELHO



I also implemented a quick and dirty version for a merge log based on
sharing a file handle (append mode + sprintf + fputs).


I tried the append + per-thread 2KB buffered sprintf + fputs when full,
with the same number of runs. The logs are out of order by chunks, the 
overhead seems higher with 1 thread, but there is no extra overhead

with 12 threads.


The results are as follow:

 * 1 thread 33 runs median tps (average is consistent):
 - no logging:22062
 - separate logging:  19360  (-12.2%)
 - merged logging:19326  (-12.4%, not significant from previous)


 - buf merged logging:   18751 (-15%, seems significant)


The worst overhead I could trigger is with 12 threads:

 * 12 threads 35 runs median tps (average is consistent)
 - no logging:   155917
 - separate logging: 124695 (-20.0%)
 - merged logging:   119349 (-23.5%)


 - buf merged logging: 124914 (-19.9%, not significant from separate)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-28 Thread Tomas Vondra
On 28.3.2015 11:21, Fabien COELHO wrote:
 
 Hello Tomas,
 
 I do agree that fprintf is not cheap, actually when profiling pgbench
 it's often the #1 item, but the impact on the measurements is actually
 quite small. For example with a small database (scale 10) and read-only
 30-second runs (single client), I get this:

   no logging: 18672 18792 18667 18518 18613 18547
 with logging: 18170 18093 18162 18273 18307 18234

 So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
 And with more expensive transactions (larger scale, writes, ...) the
 difference will be much smaller.
 
 I did some testing with a scale 10 prepared SELECT only 200 seconds
 plenty of runs with local socket connections on the largest host I have
 available:
 
   pgbench -P 10 -T 200 -S -M prepared -j $c -c $c ...
 
 I think that this cpu-bound bench is kind of a worst possible case for
 the detailed per transaction log.
 
 I also implemented a quick and dirty version for a merge log based on
 sharing a file handle (append mode + sprintf + fputs).
 
 The results are as follow:
 
  * 1 thread 33 runs median tps (average is consistent):
  - no logging:22062
  - separate logging:  19360  (-12.2%)
  - merged logging:19326  (-12.4%, not significant from previous)
 

Interesting. What hardware is this? I wouldn't be surprised by this
behavior on a multi-socket system, where the cross-cpu transfers are
likely more expensive on the single-socket I used (sorry, forgot to
mention what hardware I used).

 Note that the impact of logging is much larger than with your tests.
 The underlying fprintf comes from gnu libc 2.19.

I'm running glibc 2.19 too, on kernel 3.17 (might be important too).

 The worst overhead I could trigger is with 12 threads:
 
  * 12 threads 35 runs median tps (average is consistent)
  - no logging:   155917
  - separate logging: 124695 (-20.0%)
  - merged logging:   119349 (-23.5%)
 
 My conclusion from these figures is that although the direct merged 
 logging approach adds some overhead, this overhead is small wrt
 detailed logging (it adds 3.5% to a 20% logging overhead) with 12
 threads. Other tests, even with more threads, did not yield larger
 absolute or relative overheads. Although the direct merge approach is
 shown to add overheads, this is a small additional overhead on a
 quite bad situation already, which suggest that using detailed log on
 a cpu-bound pgbench run is a bad idea to begin with.

Yeah, seems like that.

 For a more realistic test, I ran simple updates which involve actual
 disk writes. It ran at around 840 tps with 24 threads. The logging
 overhead seems to be under 1%, and there is no significant difference
 between separate and merged on the 20 runs.
 
 So my overall conclusion is:
 
 (1) The simple thread-shared file approach would save pgbench from 
 post-processing merge-sort heavy code, for a reasonable cost.

No it wouldn't - you're missing the fact that the proposed approach
(shared file + fprintf) only works with raw transaction log.

It does not work with aggregated log - the threads would have to somehow
track the progress of the other threads somehow, in a very non-trivial
way (e.g. what if one of the threads executes a long query, and thus
does not send the results for a long time?). Another option would be to
update shared aggregated results, but that requires locking.

 (2) The feature would not be available for the thread-emulation with 
 this approach, but I do not see this as a particular issue as I
 think that it is pretty much only dead code and a maintenance burden.

I'm not willing to investigate that, nor am I willing to implement
another feature that works only sometimes (I've done that in the past,
and I find it a bad practice).

If someone else is willing to try to eliminate the thread emulation, I
won't object to that. But as I pointed out above, simple fprintf is not
going to work for the aggregated log - solving that will need more code
(e.g. maintaining aggregated results for all threads, requiring
additional locking etc).

 
 (3) Optimizing doLog from its current fprintf-based implementation
 may be a good thing.

That's probably true. The simplest thing we can do right now is
buffering the data into larger chunks and writing those chunks. That
amortizes the costs of locking. Using O_APPEND, as suggested by Andres,
seems like a promising idea.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-28 Thread Fabien COELHO


Hello Tomas,


I do agree that fprintf is not cheap, actually when profiling pgbench
it's often the #1 item, but the impact on the measurements is actually
quite small. For example with a small database (scale 10) and read-only
30-second runs (single client), I get this:

  no logging: 18672 18792 18667 18518 18613 18547
with logging: 18170 18093 18162 18273 18307 18234

So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
And with more expensive transactions (larger scale, writes, ...) the
difference will be much smaller.


I did some testing with a scale 10 prepared SELECT only 200 seconds 
plenty of runs with local socket connections on the largest host I have 
available:


  pgbench -P 10 -T 200 -S -M prepared -j $c -c $c ...

I think that this cpu-bound bench is kind of a worst possible case for the 
detailed per transaction log.


I also implemented a quick and dirty version for a merge log based on 
sharing a file handle (append mode + sprintf + fputs).


The results are as follow:

 * 1 thread 33 runs median tps (average is consistent):
 - no logging:22062
 - separate logging:  19360  (-12.2%)
 - merged logging:19326  (-12.4%, not significant from previous)

Note that the impact of logging is much larger than with your tests.
The underlying fprintf comes from gnu libc 2.19.

The worst overhead I could trigger is with 12 threads:

 * 12 threads 35 runs median tps (average is consistent)
 - no logging:   155917
 - separate logging: 124695 (-20.0%)
 - merged logging:   119349 (-23.5%)

My conclusion from these figures is that although the direct merged 
logging approach adds some overhead, this overhead is small wrt detailed 
logging (it adds 3.5% to a 20% logging overhead) with 12 threads. Other 
tests, even with more threads, did not yield larger absolute or relative 
overheads. Although the direct merge approach is shown to add overheads, 
this is a small additional overhead on a quite bad situation already, 
which suggest that using detailed log on a cpu-bound pgbench run is a bad 
idea to begin with.


For a more realistic test, I ran simple updates which involve actual 
disk writes. It ran at around 840 tps with 24 threads. The logging 
overhead seems to be under 1%, and there is no significant difference 
between separate and merged on the 20 runs.


So my overall conclusion is:

(1) The simple thread-shared file approach would save pgbench from 
post-processing merge-sort heavy code, for a reasonable cost.


(2) The feature would not be available for the thread-emulation with this 
approach, but I do not see this as a particular issue as I think that it 
is pretty much only dead code and a maintenance burden.


(3) Optimizing doLog from its current fprintf-based implementation may be 
a good thing.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread didier
Hi,

On Sat, Mar 21, 2015 at 8:42 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 Hello Didier,

 If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
 collision probability would be about 1/40 and the delayed thread would be
 waiting for half this time on average, so the performance impact due to
 fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
 1/3200 time added on the computed average, if I'm not mistaken).
Yes but for a third thread (each on a physical core) it will be 1/40 +
1/40 and so on up to roughly 40/40 for 40 cores.



 If  threads run more or less the same code with the same timing after
 a while they will lockstep  on synchronization primitives and your
 collision probability will be very close to 1.


 I'm not sure I understand. If transaction times were really constant, then
 after a while the mutexes would be synchronised so as to avoid contention,
 i.e. the collision probability would be 0?
But they aren't constant only close. It may or not show up in this
case but I've noticed that often the collision rate is a lot higher
than the probability would suggest, I'm not sure why,


 Moreover  they will write to the same cache lines for every fprintf
 and this is very very bad even without atomic operations.


 We're talking of transactions that involve network messages and possibly
 disk IOs on the server, so some cache issues issues within pgbench would not
 be a priori the main performance driver.
Sure but :
- good measurement is hard and by adding locking in fprintf it make
its timing more noisy.

- it's against 'good practices' for scalable code. Trivial code can
show that elapsed time for as low as  four cores writing to same cache
line in a loop, without locking or synchronization, is greater than
the elapsed time for running these four loops sequentially on one
core. If they write to different cache lines it scales linearly.

Regards
Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread Andres Freund
On March 23, 2015 8:00:04 PM GMT+01:00, Fabien COELHO coe...@cri.ensmp.fr 
wrote:

 Guys, I don't see this theoretical discussion going anywhere. I think
 it's time to simply implement this and evaluate it on a bigger
 machine.

Sure. I was kind of hoping that someone else would implement it,
because 
I'm a reviewer on this one, and I do not have the bigger machine at
hand 
either:-) If nothing happens and when I have some time, I'll do it.

It wasn't specifically aimed at you...

 It can't take very long to implement tosimply just write to one
 file instead of the multiple files as now.

Sure.

 The posix guaranteed fprintf locking should already take care of the 
 rest.

I think I would try with snprintf to reduce locking.

Glibc's locking is around the streambuffer IIRC. So that's not likely to do 
much...


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread Andres Freund
Hi,

Guys, I don't see this theoretical discussion going anywhere. I think
it's time to simply implement this and evaluate it on a bigger
machine. It can't take very long to implement tosimply just write to one
file instead of the multiple files as now. The posix guaranteed fprintf
locking should already take care of the rest.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread Fabien COELHO


Hello,


Yes but for a third thread (each on a physical core) it will be 1/40 +
1/40 and so on up to roughly 40/40 for 40 cores.


That is why I proposed a formula which depends on the number of threads.

[...] But they aren't constant only close. It may or not show up in this 
case but I've noticed that often the collision rate is a lot higher than 
the probability would suggest, I'm not sure why,


If so, I would suggested that the probability is wrong and try to 
understand why:-)



Moreover  they will write to the same cache lines for every fprintf
and this is very very bad even without atomic operations.


We're talking of transactions that involve network messages and possibly
disk IOs on the server, so some cache issues issues within pgbench would not
be a priori the main performance driver.



Sure but :
- good measurement is hard and by adding locking in fprintf it make
its timing more noisy.


This really depends on the probability of the lock collisions. If it is 
small enough, the impact would be negligeable.



- it's against 'good practices' for scalable code.
Trivial code can show that elapsed time for as low as four cores writing 
to same cache line in a loop, without locking or synchronization, is 
greater than the elapsed time for running these four loops sequentially 
on one core. If they write to different cache lines it scales linearly.


I'm not argumenting about general scalability principles, which may or may 
not be relevant to the case at hand.


I'm discussing whether the proposed feature can be implemented much simply 
with mutex instead of the current proposal which is on the heavy side, 
thus induces more maintenance effort latter.


Now I agree that if there is a mutex it must be a short as possible and 
not hinder performance significantly for pertinent use case. Note that 
overhead evaluation by Tomas is pessimistic as it only involves read-only 
transactions for which all transaction details are logged. Note also that 
if you have 1000 cores to run pgbench and that locking may be an issue, 
you could still use the per-thread logs.


The current discussion suggests that each thread should prepare the string 
off-lock (say with some sprintf) and then only lock when sending the 
string. This looks reasonable, but still need to be validated (i.e. the 
lock time would indeed be very small wrt the transaction time).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-23 Thread Fabien COELHO



Guys, I don't see this theoretical discussion going anywhere. I think
it's time to simply implement this and evaluate it on a bigger
machine.


Sure. I was kind of hoping that someone else would implement it, because 
I'm a reviewer on this one, and I do not have the bigger machine at hand 
either:-) If nothing happens and when I have some time, I'll do it.



It can't take very long to implement tosimply just write to one
file instead of the multiple files as now.


Sure.

The posix guaranteed fprintf locking should already take care of the 
rest.


I think I would try with snprintf to reduce locking.

Note that I'm first trying to evaluate logging overhead, but I'm quite far 
from Tomas figures at the moment, looks more like 20% overhead (12 threads 
read-only load, runs at about 13 tps with full logging, 16 
without, repeated 5 times with the similar results).


In such a case, a mutex would be a bad idea, but I must add that detailed 
logging is a plain bad idea as well!


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Andres Freund
On 2015-03-17 11:50:28 -0400, Robert Haas wrote:
 On Tue, Mar 17, 2015 at 11:27 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
  The fprintf we are talking about occurs at most once per pgbench
  transaction, possibly much less when aggregation is activated, and this
  transaction involves networks exchanges and possibly disk writes on the
  server.
 
 random() was occurring four times per transaction rather than once,
 but OTOH I think fprintf() is probably a much heavier-weight
 operation.  The way to know if there's a real problem here is to test
 it, but I'd be pretty surprised if there isn't.

Well, fprintf() doesn't have to acquire the lock for the entirety of
it's operation - just for the access to the stream buffer.

Note that posix 2001 *does* guarantee that FILE* style IO is thread
safe:
All functions that reference (FILE *) objects, except those with names
ending in _unlocked, shall behave as if they use flockfile() and
funlockfile() internally to obtain ownership of these (FILE *) objects.

Hilariously that tidbit hidden in the documentation about
flockfile. Very, err, easy to find:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html

But I agree that we simply need to test this on a larger machine.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Fabien COELHO


Hello Tomas,


My point is that if there are many threads and tremendous TPS, the
*detailed* per-transaction log (aka simple log) is probably a bad
choice anyway, and the aggregated version is the way to go.


I disagree with this reasoning. Can you provide numbers supporting it?


I'm not sure which part of this reasoning you want me to support with 
numbers.


My first argument is qualitative: I'm trying to say that if you are doing 
10 tps, probably the per-transaction detailed measure would be useless 
as such, so there is no point in logging them all. You would rather be 
interested in aggregate figures and should also be interested in extremal 
events (very slow queries) and their frequency.


The second point is that aggregated logs should be cheaper than detailed 
log, it seems more or less self evident to me.



I do agree that fprintf is not cheap, actually when profiling pgbench
it's often the #1 item, but the impact on the measurements is actually
quite small. For example with a small database (scale 10) and read-only
30-second runs (single client), I get this:

  no logging: 18672 18792 18667 18518 18613 18547
with logging: 18170 18093 18162 18273 18307 18234

So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
And with more expensive transactions (larger scale, writes, ...) the
difference will be much smaller.


Ok. Great!

Let us take this as a worst-case figure and try some maths.

If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the 
collision probability would be about 1/40 and the delayed thread would be 
waiting for half this time on average, so the performance impact due to 
fprintf locking would be negligeable (1/80 delay occured in 1/40 cases = 
1/3200 time added on the computed average, if I'm not mistaken).


With t threads, I would guess that the collision probability in the first 
order is about 0.5 t (t-1) p, and the performance impact would be ~ (0.5 t 
(t-1) p * 0.5 p) (hmmm... this approximation, if not completely stupid, 
just holds for small p and not too large t).


With your worst-case figure and some rounding, it seems to look like:

  #threadscollision probabilityperformance impact
2 1/401/3200
4 1/7 1/533
8 0.7  0.01 (about 1%)

This suggest that for a pessimistic (ro load) fprintf overhead ratio there 
would be a small impact even with 8 thread doing 2 tps each.



It's true that this might produce large logs, especially when the runs
are long, but that has nothing to do with fprintf. And can be easily
fixed by either using a dedicated client machine, or only sample the
transaction log.


Sure. In which case locking/mutex is not an issue.


Introducing actual synchronization between the threads (by locking
inside fprintf) is however a completely different thing.


It is the issue I'm trying to discuss.


Note that even without mutex fprintf may be considered a heavy
function which is going to slow down the transaction rate
significantly. That could be tested as well.

It is possible to reduce the lock time by preparing the string
(which would mean introducing buffers) and just do a fputs under
mutex. That would not reduce the print time anyway, and that may add
malloc/free operations, though.


I seriously doubt fprintf does the string formatting while holding lock
on the file.


I do not think that the lock is necessarily at the file (OS level), but at 
least it has to hold the FILE structure to manage buffering without 
interference from other threads during string processing and printing.


So by doing this you only simulate what fprintf() does (assuming it's 
thread-safe on your platform) and gain nothing.


Ok, if fprintf is thread safe, I fully agree that there is no need to add 
a mutex or lock! However I do not know how to test that, so putting a 
useless mutex would prevent



The way to know if there's a real problem here is to test it, but
I'd be pretty surprised if there isn't.


Indeed, I think I can contrive a simple example where it is,
basically a more or less empty or read only transaction (eg SELECT
1).


That would be nice, because my quick testing suggests it's not the case.


I do not understand your point. ISTM that your test and my maths suggest
that the performance impact of the fprintf approach is reasonably low.


My opinion is that there is a tradeoff between code simplicity and
later maintenance vs feature benefit.

If threads are assumed and fprintf is used, the feature is much
simpler to implement, and the maintenance is lighter.


I think the if threads are assumed part makes this dead in water
unless someone wants to spend time on getting rid of the thread
emulation.


My suggestion is to skip the feature under thread emulation, not 
necessarily to remove thread emulation, because of the argument you raise 
below.


Removing the code is quite simple, researching whether we 

Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Andres Freund
On 2015-03-21 10:37:05 +0100, Fabien COELHO wrote:
 
 Hello Tomas,
 Let us take this as a worst-case figure and try some maths.
 
 If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
 collision probability would be about 1/40 and the delayed thread would be
 waiting for half this time on average, so the performance impact due to
 fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
 1/3200 time added on the computed average, if I'm not mistaken).
 
 With t threads, I would guess that the collision probability in the first
 order is about 0.5 t (t-1) p, and the performance impact would be ~ (0.5 t
 (t-1) p * 0.5 p) (hmmm... this approximation, if not completely stupid, just
 holds for small p and not too large t).
 
 With your worst-case figure and some rounding, it seems to look like:
 
   #threadscollision probabilityperformance impact
 2 1/401/3200
 4 1/7 1/533
 8 0.7  0.01 (about 1%)
 
 This suggest that for a pessimistic (ro load) fprintf overhead ratio there
 would be a small impact even with 8 thread doing 2 tps each.

I think math like this mostly disregards hardware realities. You don't
actually need to have actual lock contention to notice overhead -
frequently acquiring an *uncontended* lock that resides in another
socket's cache and where the cacheline is dirty requires relatively
expensive cross cpu transfers.  That's all besides the overhead of doing
a lock operation itself. A lock; xaddl;, or whatever you end up using,
has a significant cost in itself. It implies a bus lock and cache flush,
which is far from free.

Additionally we're quite possibly talking about more than 8
threads. I've frequently used pgbench with hundreds of threads; for imo
good reasons.

That all said, it's far from guaranteed that there's an actual problem
here. If done right, i.e. the expensive formatting of the string is
separated from the locked output to the kernel, it might end up being
acceptable.

I wonder how bad using unbuffered write + O_APPEND would end up being;
without a lock.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread didier
Hi,

On Sat, Mar 21, 2015 at 10:37 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:

   no logging: 18672 18792 18667 18518 18613 18547
 with logging: 18170 18093 18162 18273 18307 18234

 So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
 And with more expensive transactions (larger scale, writes, ...) the
 difference will be much smaller.


 Ok. Great!

 Let us take this as a worst-case figure and try some maths.

 If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
 collision probability would be about 1/40 and the delayed thread would be
 waiting for half this time on average, so the performance impact due to
 fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
 1/3200 time added on the computed average, if I'm not mistaken).
If  threads run more or less the same code with the same timing after
a while they will lockstep  on synchronization primitives and your
collision probability will be very close to 1.

Moreover  they will write to the same cache lines for every fprintf
and this is very very bad even without atomic operations.

Regards
Didier


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Fabien COELHO


Hello Andres,


With your worst-case figure and some rounding, it seems to look like:

  #threadscollision probabilityperformance impact
2 1/401/3200
4 1/7 1/533
8 0.7  0.01 (about 1%)

This suggest that for a pessimistic (ro load) fprintf overhead ratio there
would be a small impact even with 8 thread doing 2 tps each.


I think math like this mostly disregards hardware realities.


Hmmm. In my mind, doing the maths helps understand what may be going on.

Note that it does not preclude to check afterwards that it does indeed 
correspond to reality:-)


The key suggestion of the maths is that if p*t  1 all is (seems) well.

You don't actually need to have actual lock contention to notice 
overhead.


The overhead assumed is 1/40 of the transaction time from Tomas' measures. 
Given the ~ 18000 tps (we are talking of an in-memory read-only load 
probably on the same host), transaction time for pgbench seems to be about 
0.06 ms, and fprintf seems to be about 0.0015 ms (1.5 µs).


- frequently acquiring an *uncontended* lock that resides in another 
socket's cache and where the cacheline is dirty requires relatively 
expensive cross cpu transfers. That's all besides the overhead of doing 
a lock operation itself. A lock; xaddl;, or whatever you end up using, 
has a significant cost in itself. It implies a bus lock and cache flush, 
which is far from free.


Ok, I did not assume an additional lock cost. Do you have a figure? A 
quick googling suggested figure for lightweight mutexes around 100 ns, 
but the test conditions were unclear. If it is oky, then it is does not 
change much the above maths to add that overhead.


Additionally we're quite possibly talking about more than 8 threads. 
I've frequently used pgbench with hundreds of threads; for imo good 
reasons.


Good for you. I do not have access to a host on which this would make 
sense:-)



That all said, it's far from guaranteed that there's an actual problem
here. If done right, i.e. the expensive formatting of the string is
separated from the locked output to the kernel, it might end up being
acceptable.


That is what I would like to assess. Indeed, probably snprinf (to avoid 
mallocing anything) and then fputs/write/whatever would indeed help reduce 
the contention probability, if not the actual overhead.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Fabien COELHO


Hello Didier,


If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the
collision probability would be about 1/40 and the delayed thread would be
waiting for half this time on average, so the performance impact due to
fprintf locking would be negligeable (1/80 delay occured in 1/40 cases =
1/3200 time added on the computed average, if I'm not mistaken).



If  threads run more or less the same code with the same timing after
a while they will lockstep  on synchronization primitives and your
collision probability will be very close to 1.


I'm not sure I understand. If transaction times were really constant, then 
after a while the mutexes would be synchronised so as to avoid contention, 
i.e. the collision probability would be 0?



Moreover  they will write to the same cache lines for every fprintf
and this is very very bad even without atomic operations.


We're talking of transactions that involve network messages and possibly 
disk IOs on the server, so some cache issues issues within pgbench would 
not be a priori the main performance driver.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-21 Thread Fabien COELHO



Well, fprintf() doesn't have to acquire the lock for the entirety of
it's operation - just for the access to the stream buffer.


Yep. If it is implemented by appending stuff to the stream as the format 
is processed, this would still mean the whole time of its operation.


Hence preprocessing the string as you suggested may be desirable.


Note that posix 2001 *does* guarantee that FILE* style IO is thread
safe:
All functions that reference (FILE *) objects, except those with names
ending in _unlocked, shall behave as if they use flockfile() and
funlockfile() internally to obtain ownership of these (FILE *) objects.

Hilariously that tidbit hidden in the documentation about
flockfile. Very, err, easy to find:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/flockfile.html


Thanks for the pointer!

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-20 Thread Tomas Vondra
Hi,

On 20.3.2015 13:43, Fabien COELHO wrote:
 
 Hello Robert,
 
 The fprintf we are talking about occurs at most once per pgbench 
 transaction, possibly much less when aggregation is activated,
 and this transaction involves networks exchanges and possibly
 disk writes on the server.

 random() was occurring four times per transaction rather than
 once, but OTOH I think fprintf() is probably a much heavier-weight 
 operation.
 
 Yes, sure.
 
 My point is that if there are many threads and tremendous TPS, the 
 *detailed* per-transaction log (aka simple log) is probably a bad 
 choice anyway, and the aggregated version is the way to go.

I disagree with this reasoning. Can you provide numbers supporting it?

I do agree that fprintf is not cheap, actually when profiling pgbench
it's often the #1 item, but the impact on the measurements is actually
quite small. For example with a small database (scale 10) and read-only
30-second runs (single client), I get this:

   no logging: 18672 18792 18667 18518 18613 18547
 with logging: 18170 18093 18162 18273 18307 18234

So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference.
And with more expensive transactions (larger scale, writes, ...) the
difference will be much smaller.

It's true that this might produce large logs, especially when the runs
are long, but that has nothing to do with fprintf. And can be easily
fixed by either using a dedicated client machine, or only sample the
transaction log.

Introducing actual synchronization between the threads (by locking
inside fprintf) is however a completely different thing.

 Note that even without mutex fprintf may be considered a heavy
 function which is going to slow down the transaction rate
 significantly. That could be tested as well.
 
 It is possible to reduce the lock time by preparing the string
 (which would mean introducing buffers) and just do a fputs under
 mutex. That would not reduce the print time anyway, and that may add
 malloc/free operations, though.

I seriously doubt fprintf does the string formatting while holding lock
on the file. So by doing this you only simulate what fprintf() does
(assuming it's thread-safe on your platform) and gain nothing.

 
 The way to know if there's a real problem here is to test it, but 
 I'd be pretty surprised if there isn't.
 
 Indeed, I think I can contrive a simple example where it is,
 basically a more or less empty or read only transaction (eg SELECT
 1).

That would be nice, because my quick testing suggests it's not the case.

 My opinion is that there is a tradeoff between code simplicity and 
 later maintenance vs feature benefit.
 
 If threads are assumed and fprintf is used, the feature is much 
 simpler to implement, and the maintenance is lighter.

I think the if threads are assumed part makes this dead in water
unless someone wants to spend time on getting rid of the thread
emulation. Removing the code is quite simple, researching whether we can
do that will be difficult IMHO - I have no idea which of the supported
platorms require the emulation etc. And I envision endless discussions
about this.

 The alternative implementation means reparsing the generated files 
 over and over for merging their contents.

I agree that the current implementation is not particularly pretty, and
I plan to get rid of the copypaste parts etc.

 Also, I do not think that the detailed log provides much benefit
 with very fast transactions, where probably the aggregate is a much
 better choice anyway. If the user persists, she may generate a
 per-thread log and merge it later, in which case a merge script is
 needed, but I do not think that would be a bad thing.

I disagree with this - I use transaction logs (either complete or
sampled) quite often. I also explained why I think a separate merge
script is awkward to use.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-20 Thread Fabien COELHO


Hello Robert,


The fprintf we are talking about occurs at most once per pgbench
transaction, possibly much less when aggregation is activated, and this
transaction involves networks exchanges and possibly disk writes on the
server.


random() was occurring four times per transaction rather than once,
but OTOH I think fprintf() is probably a much heavier-weight
operation.


Yes, sure.

My point is that if there are many threads and tremendous TPS, the 
*detailed* per-transaction log (aka simple log) is probably a bad choice 
anyway, and the aggregated version is the way to go.


Note that even without mutex fprintf may be considered a heavy function 
which is going to slow down the transaction rate significantly. That could 
be tested as well.


It is possible to reduce the lock time by preparing the string (which 
would mean introducing buffers) and just do a fputs under mutex. That 
would not reduce the print time anyway, and that may add malloc/free 
operations, though.


The way to know if there's a real problem here is to test it, but I'd be 
pretty surprised if there isn't.


Indeed, I think I can contrive a simple example where it is, basically a
more or less empty or read only transaction (eg SELECT 1).

My opinion is that there is a tradeoff between code simplicity and later 
maintenance vs feature benefit.


If threads are assumed and fprintf is used, the feature is much simpler to 
implement, and the maintenance is lighter. The alternative implementation 
means reparsing the generated files over and over for merging their 
contents.


Also, I do not think that the detailed log provides much benefit with very 
fast transactions, where probably the aggregate is a much better choice 
anyway. If the user persists, she may generate a per-thread log and merge 
it later, in which case a merge script is needed, but I do not think that 
would be a bad thing.


Obviously, all that is only my opinion and is quite debatable.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-17 Thread Robert Haas
On Sun, Mar 15, 2015 at 3:35 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Firstly, the fact that pgbench produces one file per thread is awkward.

 I agree, but I think it is due to the multi process thread emulation: if you
 have real threads, you can do a simple fprintf, possibly with some mutex,
 and you're done. There is really nothing to do to implement this
 feature.

I think that's probably not a good idea, because fprintf() might then
become a bottleneck.  I fixed a similar problem with random() in
commit 4af43ee3f165c8e4b332a7e680a44f4b7ba2d3c1.  Even though the
locking was happening inside libc, it was still locking, and it still
caused a contention problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-17 Thread Robert Haas
On Tue, Mar 17, 2015 at 11:27 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 The fprintf we are talking about occurs at most once per pgbench
 transaction, possibly much less when aggregation is activated, and this
 transaction involves networks exchanges and possibly disk writes on the
 server.

random() was occurring four times per transaction rather than once,
but OTOH I think fprintf() is probably a much heavier-weight
operation.  The way to know if there's a real problem here is to test
it, but I'd be pretty surprised if there isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-17 Thread Fabien COELHO


Hello,


I agree, but I think it is due to the multi process thread emulation: if you
have real threads, you can do a simple fprintf, possibly with some mutex,
and you're done. There is really nothing to do to implement this
feature.


I think that's probably not a good idea, because fprintf() might then
become a bottleneck.  I fixed a similar problem with random() in
commit 4af43ee3f165c8e4b332a7e680a44f4b7ba2d3c1.  Even though the
locking was happening inside libc, it was still locking, and it still
caused a contention problem.


The fprintf we are talking about occurs at most once per pgbench 
transaction, possibly much less when aggregation is activated, and this 
transaction involves networks exchanges and possibly disk writes on the 
server.


So I would have thought that the risk of contention because of such a lock 
would be very low in this case. If it really becomes a bottleneck, it 
means a lot of threads doing a lot of small transactions, and I would 
suggest that using the aggregated option would be the right answer to 
that.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-15 Thread Fabien COELHO

Sun, 15 Mar 2015 11:22:01 +0100 (CET)
Hello Tomas,


attached is a patch implementing merging of pgbench logs. These logs are
written by each thread, so with N threads you get N files with names

   pgbench_log.PID
   pgbench_log.PID.1
   ...
   pgbench_log.PID.N

Before analyzing these logs, these files need to be combined. I usually
ended up wrinting ad-hoc scripts doing that, lost them, written them
again and so on over and over again.


I've looked at the patch. Although I think that such a feature is somehow 
desirable... I have two issues with it: ISTM that


(1) it does not belong to pgbench as such

(2) even if, the implementation is not right

About (1):

I think that this functionnality as implemented does not belong to 
pgbench, and does really belong to an external script, which happen not to 
be readily available, which is a shame. PostgreSQL should probably provide 
such a script, or make it easy to find.


It could belong to pgbench if pgbench was *generating* the merged log file 
directly. However I'm not sure this can be done cleanly for the 
multi-process thread emulation (IN PASSING: which I think this should 
really get rid of because I do not know what system does not provide 
threads nowadays and would require to have a state of the art pgbench 
nevertheless, and this emulation significantly complexify the code by 
making things uselessly difficult and/or needed to be implemented twice or 
not be provided in some cases).


Another issue raised by your patch is that the log format may be improved, 
say by providing a one-field timestamp at the beginning of the line.



About (2):

In the implementation you reimplement a partial merge sort by parsing each 
line of each file and merging it with the current result over and over. 
ISTM that an implementation should read all files in parallel and merge 
them in one pass. The current IO complexity is in p²n where it should be 
simply pn... do not use it with a significant number of threads and many 
lines... Ok, the generated files are likely to be kept in cache, but 
nevertheless.


Also there are plenty of copy paste when scanning for two files, and then 
reprinting in all the different formats. The same logic is implemented 
twice, once for simple and once for aggregated. This means that updating 
or extending the log format later on would require to modify these scans 
and prints in many places.


For aggregates, some data in the output may be special values NaN/-/..., 
I am not sure how the implementation would behave in such cases. As lines 
that do not match are silently ignored, the result merge would just be non 
significant should it rather be an error? Try it with a low rate for 
instance.


It seems that system function calls are not tested for errors.


The other disadvantage of the external scripts is that you have to pass
all the info about the logs (whether the logs are aggregated, whther
there's throttling, etc.).


I think that is another argument to make a better format, with the a 
timestamp ahead? Also, ISTM that it only needs to know whether it is 
merging aggregate or simple logs, no more, the other information can be 
infered by the number of fields on the line.



So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.


You guessed that I disagree. Note that this is only my own opinion.

In summary, I think that:

(a) providing a clean script would be nice,

(b) if we could get rid of the thread emulation, pgbench could generate 
the merged logs directly and simply, and the option could be provided 
then.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-15 Thread Fabien COELHO



Firstly, the fact that pgbench produces one file per thread is awkward.


I agree, but I think it is due to the multi process thread emulation: if 
you have real threads, you can do a simple fprintf, possibly with some 
mutex, and you're done. There is really nothing to do to implement this

feature.


Secondly, a separate tool (even if provided with pgbench) would require
passing detailed info about the log format - whether it's aggregated or
not, throttling or not, 


Hmmm.


It could belong to pgbench if pgbench was *generating* the merged
log file directly. [...]


I considered this approach, i.e. adding another 'collector' thread
receiving results from all the other threads, but decided not to do
that. That would require a fair amount of inter-process communication,
locking etc. and might affect the measurements


I agree that inter-process stuff should be avoided. This is not what I had 
in mind. I was thinking of fprintf on the same file handler by different 
threads.



Another issue raised by your patch is that the log format may be
improved, say by providing a one-field timestamp at the beginning of
the line.


I don't see how this is relevant to this patch?


For the simple log format, all the parsing needed would be for the 
timestamp, and the remainder would just be text to pass along, no need to 
%d %f... whatever.



The current IO complexity is in p²n where it should be simply pn...


Maybe. Implementing a 2-way merge sort was the simplest solution at the
moment, and I don't think this really matters because the I/O generated
by the benchmark itself is usually much higher than this.


If you do not do a n-way merge, you could do a 2-way merge on a binary 
tree so that the IO complexity would be p.log(p).n (I think), and not p²n.



For aggregates, some data in the output may be special values
NaN/-/..., [...]

You mean the aggregated log? I can't think of a way to get there such
values - can you provide a plausible scenario how that could happen?


Possibly I'm wrong. Ok, I tried it:

 sh ./pgbench --rate=0.5 --aggregate-interval=1 --log
 sh cat pgbench_log.12671
 1426445236 1 5034 25341156 5034 5034 687 471969 687 687
 1426445237 0 0 0 0 0 0 0 0 0
 1426445238 0 0 0 0 0 0 0 0 0
 1426445239 1 8708 75829264 8708 8708 2063 4255969 2063 2063
 ...

Good news, I could not generate strange values. Just when there are 0 
transactions possibly some care should be taken when combining values.



It seems that system function calls are not tested for errors.


That's true, but that's how the rest of pgbench code is written.


Hmmm This is not entirely true, there are *some* checks if you look 
carefully:-)


   if ((fd = fopen(filename, r)) == NULL) ...
   if ((fp = popen(command, r)) == NULL) ...
   nsocks = select(...)
   if (nsocks  0) ...


So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.


You guessed that I disagree. Note that this is only my own opinion.

In summary, I think that:

(a) providing a clean script would be nice,

(b) if we could get rid of the thread emulation, pgbench could
generate the merged logs directly and simply, and the option could
be provided then.


That however is not the goal of this patch.


Sure. My point is that the amount of code you write to implement this 
merge stuff is due to this feature. Without it, the patch would probably 
need 10 lines of code. Moreover, the way it is implement requires scanning 
and reprinting, which means more work in many places to update the format 
later.



The thread emulation is there for a reason,


My opinion is that it *was* there for a reason. Whether it makes sense 
today to still have it, maintain it, and have to write such heavy code for 
a trivial feature just because of it is another matter.



and I certainly am not going to work on eliminating it
(not sure that's even possible).


I wish it will be:-)

I would suggest this that I would support: implement this feature the 
simple way (aka fprintf, maybe a mutex) when compiled with threads, and 
generate an error feature not available with process-based thread 
emulation when compiled with processes. This way we avoid a, lot of heavy 
code to maintain in the future, and you still get the feature within 
pgbench. There are already some things which are not the same with thread 
emulation because it would have been tiring to implement for it for very 
little benefit.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-15 Thread Tomas Vondra
On 15.3.2015 11:22, Fabien COELHO wrote:

 I've looked at the patch. Although I think that such a feature is
 somehow desirable... I have two issues with it: ISTM that
 
 (1) it does not belong to pgbench as such
 
 (2) even if, the implementation is not right
 
 About (1):
 
 I think that this functionnality as implemented does not belong to 
 pgbench, and does really belong to an external script, which happen 
 not to be readily available, which is a shame. PostgreSQL should 
 probably provide such a script, or make it easy to find.

I disagree, for two main reasons.

Firstly, the fact that pgbench produces one file per thread is awkward.
It was implemented like this most likely because it was the simplest
solution after adding -j in 9.0, but I don't remember if I ever needed
a separate log from a single thread.

Secondly, a separate tool (even if provided with pgbench) would require
passing detailed info about the log format - whether it's aggregated or
not, throttling or not, 

Those are the two main reasons why I think this should be implemented as
another option in pgbench.


 It could belong to pgbench if pgbench was *generating* the merged
 log file directly. However I'm not sure this can be done cleanly for
 the multi-process thread emulation (IN PASSING: which I think this 
 should really get rid of because I do not know what system does not 
 provide threads nowadays and would require to have a state of the art
 pgbench nevertheless, and this emulation significantly complexify the
 code by making things uselessly difficult and/or needed to be 
 implemented twice or not be provided in some cases).

I considered this approach, i.e. adding another 'collector' thread
receiving results from all the other threads, but decided not to do
that. That would require a fair amount of inter-process communication,
locking etc. and might affect the measurements (especially for the
workloads with extremely short transactions, like read-only pgbench).

The implemented approach, i.e. merging results collected by each thread
independently, after the benchmarking actually completed, is a better
solution. And the code is actually simpler.


 Another issue raised by your patch is that the log format may be 
 improved, say by providing a one-field timestamp at the beginning of 
 the line.

I don't see how this is relevant to this patch?


 About (2):
 
 In the implementation you reimplement a partial merge sort by
 parsing each line of each file and merging it with the current result
 over and over. ISTM that an implementation should read all files in
 parallel and merge them in one pass. The current IO complexity is in
 p²n where it should be simply pn... do not use it with a significant
 number of threads and many lines... Ok, the generated files are
 likely to be kept in cache, but nevertheless.

Maybe. Implementing a 2-way merge sort was the simplest solution at the
moment, and I don't think this really matters because the I/O generated
by the benchmark itself is usually much higher than this.

The only case when this might make a difference is probably merging
large transaction logs (e.g. 100% sample from read-only test on small
dataset).

 Also there are plenty of copy paste when scanning for two files, and 
 then reprinting in all the different formats. The same logic is 
 implemented twice, once for simple and once for aggregated. This
 means that updating or extending the log format later on would
 require to modify these scans and prints in many places.

That is true, and I'll address that somehow (either moving the code to a
macro or a separate utility function).

 For aggregates, some data in the output may be special values 
 NaN/-/..., I am not sure how the implementation would behave in
 such cases. As lines that do not match are silently ignored, the
 result merge would just be non significant should it rather be an
 error? Try it with a low rate for instance.

You mean the aggregated log? I can't think of a way to get there such
values - can you provide a plausible scenario how that could happen?

 
 It seems that system function calls are not tested for errors.

That's true, but that's how the rest of pgbench code is written.

 
 The other disadvantage of the external scripts is that you have to
 pass all the info about the logs (whether the logs are aggregated,
 whther there's throttling, etc.).
 
 I think that is another argument to make a better format, with the a
 timestamp ahead? Also, ISTM that it only needs to know whether it is
 merging aggregate or simple logs, no more, the other information can be
 infered by the number of fields on the line.

No, it's not. Even if you change the format like this, you still have no
idea whether the log is a per-transaction log (possibly with some
additional options), aggregated log.

There might be some auto-detection based on number of fields, for
example, but considering how many options influence that I consider that
really unreliable.

 
 So 

Re: [HACKERS] PATCH: pgbench - merging transaction logs

2015-03-15 Thread Tomas Vondra
On 15.3.2015 20:35, Fabien COELHO wrote:
 
 Firstly, the fact that pgbench produces one file per thread is awkward.
 
 I agree, but I think it is due to the multi process thread emulation: if
 you have real threads, you can do a simple fprintf, possibly with some
 mutex, and you're done. There is really nothing to do to implement this
 feature.

My fear was that this either requires explicit locking, or some implicit
locking - for example while fprintf() in glibc is thread-safe, I really
am not sure about Win32 for example. This might influence the results
for very short transactions - I haven't tried that, though, so this
might be a false assumption.

The way I implemented the merge is immute to this.

 Secondly, a separate tool (even if provided with pgbench) would
 require passing detailed info about the log format - whether it's
 aggregated or not, throttling or not, 
 
 Hmmm.
 
 It could belong to pgbench if pgbench was *generating* the merged
 log file directly. [...]

 I considered this approach, i.e. adding another 'collector' thread 
 receiving results from all the other threads, but decided not to
 do that. That would require a fair amount of inter-process
 communication, locking etc. and might affect the measurements
 
 I agree that inter-process stuff should be avoided. This is not what 
 I had in mind. I was thinking of fprintf on the same file handler 
 by different threads.

That still involves some sort of 'implicit' locking, no? And as I
mentioned, I'm not sure fprintf() is thread-safe on all the platforms we
support.


 Another issue raised by your patch is that the log format may be 
 improved, say by providing a one-field timestamp at the beginning
 of the line.

 I don't see how this is relevant to this patch?
 
 For the simple log format, all the parsing needed would be for the 
 timestamp, and the remainder would just be text to pass along, no
 need to %d %f... whatever.

Oh, ok. Well, that's true, but I don't think that significantly changes
the overall complexity.

 The current IO complexity is in p²n where it should be simply pn...

 Maybe. Implementing a 2-way merge sort was the simplest solution at
 the moment, and I don't think this really matters because the I/O
 generated by the benchmark itself is usually much higher than this.
 
 If you do not do a n-way merge, you could do a 2-way merge on a
 binary tree so that the IO complexity would be p.log(p).n (I think),
 and not p²n.

Yes, I could do that.

I still think this probably an overengineering, but not a big deal. I'll
leave this for later, though (this patch is in 2015-06 CF anyway).

 For aggregates, some data in the output may be special values
 NaN/-/..., [...]
 You mean the aggregated log? I can't think of a way to get there such
 values - can you provide a plausible scenario how that could happen?
 
 Possibly I'm wrong. Ok, I tried it:
 
  sh ./pgbench --rate=0.5 --aggregate-interval=1 --log
  sh cat pgbench_log.12671
  1426445236 1 5034 25341156 5034 5034 687 471969 687 687
  1426445237 0 0 0 0 0 0 0 0 0
  1426445238 0 0 0 0 0 0 0 0 0
  1426445239 1 8708 75829264 8708 8708 2063 4255969 2063 2063
  ...
 
 Good news, I could not generate strange values. Just when there are
 0 transactions possibly some care should be taken when combining 
 values.

Yeah, I wasn't able to come up with such scenario, but wasn't sure.

 
 It seems that system function calls are not tested for errors.

 That's true, but that's how the rest of pgbench code is written.
 
 Hmmm This is not entirely true, there are *some* checks if you look
 carefully:-)
 
if ((fd = fopen(filename, r)) == NULL) ...
if ((fp = popen(command, r)) == NULL) ...
nsocks = select(...)
if (nsocks  0) ...

OK, there are a few checks ;-) But none of the fprintf calls IIRC.

Anyway, I plan to refactor this part of the patch to get rid of the
copy'n'paste pieces, so I'll take care of this too.

 (b) if we could get rid of the thread emulation, pgbench could
 generate the merged logs directly and simply, and the option could
 be provided then.

 That however is not the goal of this patch.
 
 Sure. My point is that the amount of code you write to implement this
 merge stuff is due to this feature. Without it, the patch would 
 probably need 10 lines of code. Moreover, the way it is implement 
 requires scanning and reprinting, which means more work in many 
 places to update the format later.

Possibly. But it's not written like that :-(


 The thread emulation is there for a reason,
 
 My opinion is that it *was* there for a reason. Whether it makes 
 sense today to still have it, maintain it, and have to write such 
 heavy code for a trivial feature just because of it is another 
 matter.

Possibly. I can't really decide that.

 and I certainly am not going to work on eliminating it
 (not sure that's even possible).
 
 I wish it will be:-)
 
 I would suggest this that I would support: implement this feature the
 simple way (aka fprintf, maybe a 

[HACKERS] PATCH: pgbench - merging transaction logs

2015-03-08 Thread Tomas Vondra
Hi there,

attached is a patch implementing merging of pgbench logs. These logs are
written by each thread, so with N threads you get N files with names

pgbench_log.PID
pgbench_log.PID.1
...
pgbench_log.PID.N

Before analyzing these logs, these files need to be combined. I usually
ended up wrinting ad-hoc scripts doing that, lost them, written them
again and so on over and over again.

The other disadvantage of the external scripts is that you have to pass
all the info about the logs (whether the logs are aggregated, whther
there's throttling, etc.).

So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.

With '-m' or '--merge-logs' on the command-line, the logs are merged at
the end, using a simple 2-way merge to keep the log sorted by the time
field. It should work with all the other options that influence the log
format (--rate, --aggregate-interval and --latency-limit).

I'll add this to CF 2016-06.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 706fdf5..d6ec87e 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -157,6 +157,11 @@ char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
 /*
+ * merge logs (transaction logs, aggregated logs) at the end
+ */
+bool		merge_logs = false;
+
+/*
  * end of configurable parameters
  */
 
@@ -367,6 +372,10 @@ static void *threadRun(void *arg);
 static void doLog(TState *thread, CState *st, FILE *logfile, instr_time *now,
 	  AggVals *agg, bool skipped);
 
+static void merge_log_files(int agg_interval, int nthreads);
+static void merge_aggregated_logs(FILE *infile_a, FILE *infile_b, FILE *outfile);
+static void merge_simple_logs(FILE *infile_a, FILE *infile_b, FILE *outfile);
+
 static void
 usage(void)
 {
@@ -408,6 +417,7 @@ usage(void)
 		 -v, --vacuum-all vacuum all four standard tables before tests\n
 		 --aggregate-interval=NUM aggregate data over NUM seconds\n
 		 --sampling-rate=NUM  fraction of transactions to log (e.g. 0.01 for 1%%)\n
+		 -m, --merge-logs merge logs produced by multiple threads\n
 		   \nCommon options:\n
 		 -d, --debug  print debugging output\n
 	-h, --host=HOSTNAME  database server host or socket directory\n
@@ -2733,6 +2743,7 @@ main(int argc, char **argv)
 		{aggregate-interval, required_argument, NULL, 5},
 		{rate, required_argument, NULL, 'R'},
 		{latency-limit, required_argument, NULL, 'L'},
+		{merge-logs, no_argument, NULL, 'm'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2808,7 +2819,7 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
-	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:, long_options, optindex)) != -1)
+	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:m, long_options, optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -3017,6 +3028,10 @@ main(int argc, char **argv)
 	latency_limit = (int64) (limit_ms * 1000);
 }
 break;
+			case 'm':
+printf(merge logs\n);
+merge_logs = true;
+break;
 			case 0:
 /* This covers long options which take no argument. */
 if (foreign_keys || unlogged_tables)
@@ -3137,6 +3152,12 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (merge_logs  (! use_log))
+	{
+		fprintf(stderr, log merging is allowed only when actually logging transactions\n);
+		exit(1);
+	}
+
 	/*
 	 * is_latencies only works with multiple threads in thread-based
 	 * implementations, not fork-based ones, because it supposes that the
@@ -3418,6 +3439,10 @@ main(int argc, char **argv)
  throttle_lag, throttle_lag_max, throttle_latency_skipped,
  latency_late);
 
+	/* Merge logs, if needed */
+	if (merge_logs)
+		merge_log_files(agg_interval, nthreads);
+
 	return 0;
 }
 
@@ -3783,6 +3808,339 @@ done:
 	return result;
 }
 
+static void
+merge_log_files(int agg_interval, int nthreads)
+{
+	int i;
+
+	/* we can do this as 2-way merges (all the logs are sorted by timestamp) */
+	for (i = 1; i  nthreads; i++)
+	{
+		char	logpath[64];
+		char	logpath_new[64];
+
+		/* input and output files */
+		FILE   *infile_a, *infile_b;
+		FILE   *outfile;
+
+		/* the first input is always the 'main' logfile */
+		snprintf(logpath, sizeof(logpath), pgbench_log.%d, main_pid);
+		infile_a = fopen(logpath, r);
+
+		if (infile_b == NULL)
+		{
+			fprintf(stderr, Couldn't open logfile \%s\: %s, logpath, strerror(errno));
+			return;
+		}
+
+		snprintf(logpath, sizeof(logpath), pgbench_log.%d.%d, main_pid, i);
+		infile_b = fopen(logpath, r);
+
+		if (infile_b == NULL)
+		{
+			fprintf(stderr, Couldn't open logfile \%s\: %s, logpath, strerror(errno));
+			return;
+		}
+
+		snprintf(logpath, sizeof(logpath),