Re: [HACKERS] PATCH: pgbench - merging transaction logs

Fabien COELHO Sun, 15 Mar 2015 03:23:25 -0700

Sun, 15 Mar 2015 11:22:01 +0100 (CET)
Hello Tomas,

attached is a patch implementing merging of pgbench logs. These logs are
written by each thread, so with N threads you get N files with names


   pgbench_log.PID
   pgbench_log.PID.1
   ...
   pgbench_log.PID.N

Before analyzing these logs, these files need to be combined. I usually
ended up wrinting ad-hoc scripts doing that, lost them, written them
again and so on over and over again.

I've looked at the patch. Although I think that such a feature is somehowdesirable... I have two issues with it: ISTM that


(1) it does not belong to pgbench as such

(2) even if, the implementation is not right

About (1):

I think that this functionnality as implemented does not belong topgbench, and does really belong to an external script, which happen not tobe readily available, which is a shame. PostgreSQL should probably providesuch a script, or make it easy to find.

It could belong to pgbench if pgbench was *generating* the merged log filedirectly. However I'm not sure this can be done cleanly for themulti-process "thread emulation" (IN PASSING: which I think this shouldreally get rid of because I do not know what system does not providethreads nowadays and would require to have a state of the art pgbenchnevertheless, and this emulation significantly complexify the code bymaking things uselessly difficult and/or needed to be implemented twice ornot be provided in some cases).

Another issue raised by your patch is that the log format may be improved,say by providing a one-field timestamp at the beginning of the line.



About (2):

In the implementation you reimplement a partial merge sort by parsing eachline of each file and merging it with the current result over and over.ISTM that an implementation should read all files in parallel and mergethem in one pass. The current IO complexity is in p²n where it should besimply pn... do not use it with a significant number of threads and manylines... Ok, the generated files are likely to be kept in cache, butnevertheless.

Also there are plenty of copy paste when scanning for two files, and thenreprinting in all the different formats. The same logic is implementedtwice, once for simple and once for aggregated. This means that updatingor extending the log format later on would require to modify these scansand prints in many places.

For aggregates, some data in the output may be special values "NaN/-/...",I am not sure how the implementation would behave in such cases. As linesthat do not match are silently ignored, the result merge would just be nonsignificant.... should it rather be an error? Try it with a low rate forinstance.


It seems that system function calls are not tested for errors.

The other disadvantage of the external scripts is that you have to pass
all the info about the logs (whether the logs are aggregated, whther
there's throttling, etc.).

I think that is another argument to make a better format, with the atimestamp ahead? Also, ISTM that it only needs to know whether it ismerging aggregate or simple logs, no more, the other information can beinfered by the number of fields on the line.

So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.


You guessed that I disagree. Note that this is only my own opinion.

In summary, I think that:

(a) providing a clean script would be nice,

(b) if we could get rid of the "thread emulation", pgbench could generatethe merged logs directly and simply, and the option could be providedthen.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench - merging transaction logs

Reply via email to