I attached the updated patch.

# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

 - v15.1: refactoring is a definite improvement.
   Good, even if it is not very useful (see below).

   While restructuring, maybe predefined variables could be make readonly
   so that a script which would update them would fail, which would be a
   good thing. Maybe this is probably material for an independent patch.

 - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the #threads and #transactions lines. What do you think?

Aggregate logging description: "{ failures | ... }" seems misleading because it suggests we have one or the other, whereas it can also be empty. I suggest: "{ | failures | ... }" to show the empty case.

Having a full example with retries in the doc is a good thing, and illustrates in passing that running with a number of clients on a small scale does not make much sense because of the contention on tellers/branches. I'd wonder whether the number of tries is set too high, though, ISTM that an application should give up before 100? I like that the feature it is also limited by latency limit.

Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already 
said, a native
speaker proofreading would be nice.

From a technical writing point of view, maybe the documentation could be 
improved a bit,
but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." 
is unclear,
please use a positive sentence to tell what is true instead of what is not and 
the reader
has to guess. Maybe: "The latency figures include failed transactions which 
have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." 
ISTM that
this is a pain for scripts which would like to process these reports data, 
because the data
may or may not be there. I'm sure to write such scripts, which explains my 
concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not 
one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good 
addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with 
this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be 
on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variables states… no clearing, no copying should be needed. The restarted transaction simply overrides the existing variables which is what the previous version was doing anyway. The scripts should write their own variables before using them, and if they don't then it is the user problem. This is important for performance, because it means that after a client has executed all scripts once the variable array is stable and does not incur significant maintenance costs. The only thing that needs saving for retry is the speudo-random generator state. This suggest simplifying or removing "RetryState".

# CODE

The semantics of "cnt" is changed. Ok, the overall counters and their relationships make sense, and it simplifies the reporting code. Good.

In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and could be dealt with the default case. We are only interested in serialization/deadlocks which are fatal errors?

doRetry: for consistency, given the assert, ISTM that it should return false if duration has expired, by testing end_time or timer_exceeded.

checkTransactionStatus: this function does several things at once with 2 booleans, which make it not very clear to me. Maybe it would be clearer if it would just return an enum (in trans, not in trans, conn error, other error). Another reason to do that is that on connection error pgbench could try to reconnect, which would be an interesting later extension, so let's pave the way for that. Also, I do not think that the function should print out a message, it should be the caller decision to do that.

verbose_errors: there is more or less repeated code under RETRY and FAILURE, which should be factored out in a separate function. The advanceConnectionFunction is long enough. Once this is done, there is no need for a getLatencyUsed function.

I'd put cleaning up the pipeline in a function. I do not understand why the pipeline mode is not exited in all cases, the code checks for the pipeline status twice in a few lines. I'd put this cleanup in the sync function as well, report to the caller (advanceConnectionState) if there was an error, which would be managed there.

WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to avoid code repetition (there and in the "error:" label in readCommandResponse). On the other hand, I'm not sure why the loop is needed: we can only get there by submitting a "ROLLBACK" command, so there should be only one result anyway?

report_per_command: please always count retries and failures of commands even if they will not be reported in the end, the code will be simpler and more efficient.

doLog: the format has changed, including a new string on failures which replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM that the string could be shorten to "deadlock" or "serialization". ISTM that the documentation example should include a line with a failure, to make it clear what to expect.

I'm okay with always getting computing thread stats.

# COMMENTS

struct StatsData comment is helpful.
 - "failed transactions" -> "unsuccessfully retried transactions"?
 - 'cnt' decomposition: first term is field 'retried'? if so say it
   explicitely?

"Complete the failed transaction" sounds strange: If it failed, it could not complete? I'd suggest "Record a failed transaction".

# TESTS

I suggested to simplify the tests by using conditionals & sequences. You reported that you got stuck. Hmmm.

I tried again my tests which worked fine when started with 2 clients, otherwise they get stuck because the first client waits for the other one which does not exists (the point is to generate deadlocks and other errors). Maybe this is your issue?

Could you try with:

  psql < deadlock_prep.sql
  pgbench -t 4 -c 2 -f deadlock.sql
  # note: each deadlock detection takes 1 second

  psql < deadlock_prep.sql
  pgbench -t 10 -c 2 -f serializable.sql
  # very quick 50% serialization errors

--
Fabien.

Reply via email to