Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

Fabien COELHO Mon, 19 Jul 2021 11:04:40 -0700

I attached the updated patch.


# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

 - v15.1: refactoring is a definite improvement.
   Good, even if it is not very useful (see below).

   While restructuring, maybe predefined variables could be make readonly
   so that a script which would update them would fail, which would be a
   good thing. Maybe this is probably material for an independent patch.

 - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the#threads and #transactions lines. What do you think?

Aggregate logging description: "{ failures | ... }" seems misleadingbecause it suggests we have one or the other, whereas it can also beempty. I suggest: "{ | failures | ... }" to show the empty case.

Having a full example with retries in the doc is a good thing, andillustrates in passing that running with a number of clients on a smallscale does not make much sense because of the contention ontellers/branches. I'd wonder whether the number of tries is set too high,though, ISTM that an application should give up before 100? I like thatthe feature it is also limited by latency limit.


Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already 
said, a native
speaker proofreading would be nice.

From a technical writing point of view, maybe the documentation could be 
improved a bit,

but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." 
is unclear,
please use a positive sentence to tell what is true instead of what is not and 
the reader
has to guess. Maybe: "The latency figures include failed transactions which 
have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." 
ISTM that
this is a pain for scripts which would like to process these reports data, 
because the data
may or may not be there. I'm sure to write such scripts, which explains my 
concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not 
one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good 
addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with 
this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be 
on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variablesstates… no clearing, no copying should be needed. The restartedtransaction simply overrides the existing variables which is what theprevious version was doing anyway. The scripts should write their ownvariables before using them, and if they don't then it is the userproblem. This is important for performance, because it means that after aclient has executed all scripts once the variable array is stable and doesnot incur significant maintenance costs. The only thing that needs savingfor retry is the speudo-random generator state. This suggest simplifyingor removing "RetryState".


# CODE

The semantics of "cnt" is changed. Ok, the overall counters and theirrelationships make sense, and it simplifies the reporting code. Good.

In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed andcould be dealt with the default case. We are only interested inserialization/deadlocks which are fatal errors?

doRetry: for consistency, given the assert, ISTM that it should returnfalse if duration has expired, by testing end_time or timer_exceeded.

checkTransactionStatus: this function does several things at once with 2booleans, which make it not very clear to me. Maybe it would be clearer ifit would just return an enum (in trans, not in trans, conn error, othererror). Another reason to do that is that on connection error pgbenchcould try to reconnect, which would be an interesting later extension, solet's pave the way for that. Also, I do not think that the functionshould print out a message, it should be the caller decision to do that.

verbose_errors: there is more or less repeated code under RETRY andFAILURE, which should be factored out in a separate function. TheadvanceConnectionFunction is long enough. Once this is done, there is noneed for a getLatencyUsed function.

I'd put cleaning up the pipeline in a function. I do not understand whythe pipeline mode is not exited in all cases, the code checks for thepipeline status twice in a few lines. I'd put this cleanup in the syncfunction as well, report to the caller (advanceConnectionState) if therewas an error, which would be managed there.

WAIT_ROLLBACK_RESULT: consumming results in a while could be a function toavoid code repetition (there and in the "error:" label inreadCommandResponse). On the other hand, I'm not sure why the loop isneeded: we can only get there by submitting a "ROLLBACK" command, so thereshould be only one result anyway?

report_per_command: please always count retries and failures of commandseven if they will not be reported in the end, the code will be simpler andmore efficient.

doLog: the format has changed, including a new string on failures whichreplace the time field. Hmmm. Cannot say I like it much, but why not. ISTMthat the string could be shorten to "deadlock" or "serialization". ISTMthat the documentation example should include a line with a failure, tomake it clear what to expect.


I'm okay with always getting computing thread stats.

# COMMENTS

struct StatsData comment is helpful.
 - "failed transactions" -> "unsuccessfully retried transactions"?
 - 'cnt' decomposition: first term is field 'retried'? if so say it
   explicitely?

"Complete the failed transaction" sounds strange: If it failed, it couldnot complete? I'd suggest "Record a failed transaction".


# TESTS

I suggested to simplify the tests by using conditionals & sequences. Youreported that you got stuck. Hmmm.

I tried again my tests which worked fine when started with 2 clients,otherwise they get stuck because the first client waits for the other onewhich does not exists (the point is to generate deadlocks and othererrors). Maybe this is your issue?


Could you try with:

  psql < deadlock_prep.sql
  pgbench -t 4 -c 2 -f deadlock.sql
  # note: each deadlock detection takes 1 second

  psql < deadlock_prep.sql
  pgbench -t 10 -c 2 -f serializable.sql
  # very quick 50% serialization errors

--
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

Reply via email to