Sun, 15 Mar 2015 11:22:01 +0100 (CET)
Hello Tomas,

attached is a patch implementing merging of pgbench logs. These logs are
written by each thread, so with N threads you get N files with names

   pgbench_log.PID
   pgbench_log.PID.1
   ...
   pgbench_log.PID.N

Before analyzing these logs, these files need to be combined. I usually
ended up wrinting ad-hoc scripts doing that, lost them, written them
again and so on over and over again.

I've looked at the patch. Although I think that such a feature is somehow desirable... I have two issues with it: ISTM that

(1) it does not belong to pgbench as such

(2) even if, the implementation is not right

About (1):

I think that this functionnality as implemented does not belong to pgbench, and does really belong to an external script, which happen not to be readily available, which is a shame. PostgreSQL should probably provide such a script, or make it easy to find.

It could belong to pgbench if pgbench was *generating* the merged log file directly. However I'm not sure this can be done cleanly for the multi-process "thread emulation" (IN PASSING: which I think this should really get rid of because I do not know what system does not provide threads nowadays and would require to have a state of the art pgbench nevertheless, and this emulation significantly complexify the code by making things uselessly difficult and/or needed to be implemented twice or not be provided in some cases).

Another issue raised by your patch is that the log format may be improved, say by providing a one-field timestamp at the beginning of the line.


About (2):

In the implementation you reimplement a partial merge sort by parsing each line of each file and merging it with the current result over and over. ISTM that an implementation should read all files in parallel and merge them in one pass. The current IO complexity is in p²n where it should be simply pn... do not use it with a significant number of threads and many lines... Ok, the generated files are likely to be kept in cache, but nevertheless.

Also there are plenty of copy paste when scanning for two files, and then reprinting in all the different formats. The same logic is implemented twice, once for simple and once for aggregated. This means that updating or extending the log format later on would require to modify these scans and prints in many places.

For aggregates, some data in the output may be special values "NaN/-/...", I am not sure how the implementation would behave in such cases. As lines that do not match are silently ignored, the result merge would just be non significant.... should it rather be an error? Try it with a low rate for instance.

It seems that system function calls are not tested for errors.

The other disadvantage of the external scripts is that you have to pass
all the info about the logs (whether the logs are aggregated, whther
there's throttling, etc.).

I think that is another argument to make a better format, with the a timestamp ahead? Also, ISTM that it only needs to know whether it is merging aggregate or simple logs, no more, the other information can be infered by the number of fields on the line.

So integrating this into pgbench directly seems like a better approach,
and the attached patch implements that.

You guessed that I disagree. Note that this is only my own opinion.

In summary, I think that:

(a) providing a clean script would be nice,

(b) if we could get rid of the "thread emulation", pgbench could generate the merged logs directly and simply, and the option could be provided then.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to