Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Marina Polyakova Wed, 11 Jul 2018 09:10:33 -0700

On 11-07-2018 16:24, Fabien COELHO wrote:

Hello Marina,
* -d/--debug: I'm not in favor in requiring a mandatory text argumenton this option.
As you wrote in [1], adding an additional option is also a bad idea:
Hey, I'm entitled to some internal contradictions:-)


... and discussions will be continue forever %-)

I'm sceptical of the "--debug-fails" options. ISTM that --debug isalready there and should just be reused.
I was thinking that you could just use the existing --debug, not
change its syntax. My point was that --debug exists, and you could
just print
the messages when under --debug.

Now I understand you better, thanks. I think it will be useful toreceive only messages about failures, because they and progress reportscan be lost in many other debug messages such as "client %d sending ..."/ "client %d executing ..." / "client %d receiving".

Maybe it's better to use an optional argument/arguments forcompatibility (--debug[=fails] or --debug[=NUM])? But if we use thenumbers, now I can see only 2 levels, and there's no guarantee thatthey will no change..
Optional arguments to options (!) are not really clean things, so I'd
like to avoid going onto this path, esp. as I cannot see any other
instance in pgbench or elsewhere in postgres,

AFAICS they are used in pg_waldump (option --stats[=record]) and in psql(option --help[=topic]).

and I personnaly
consider these as a bad idea.

So if absolutely necessary, a new option is still better than changing
--debug syntax. If not necessary, then it is better:-)

Ok!

* I'm reserved about the whole ereport thing, see comments in other
messages.
Thank you, I'll try to implement the error reporting in the way yousuggested.
Dunno if it is a good idea either. The committer word is the good one
in the end:-à

I agree with you that ereport has good reasons to be non-trivial in thebackend and it does not have the same in pgbench..

* doCustom changes.
On CSTATE_FAILURE, the next command is possibly started. Althoughthere is some consistency with the previous point, I think that ittotally breaks the state automaton where now a command can startwhile the whole transaction is in failing state anyway. There was nopoint in starting it in the first place.
So, for me, the FAILURE state should record/count the failure, thenskip
to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
This is much clearer that way.
Then RETRY should reinstate the global state and proceed to start the*first* command again.
<...>
It is unclear to me why backslash command errors are turned toFAILURE
instead of ABORTED: there is no way they are going to be retried, so
maybe they should/could skip directly to ABORTED?
So do you propose to execute the command "ROLLBACK" withoutcalculating its latency etc. if we are in a failed transaction andclear the conditional stack after each failure?
Also just to be clear: do you want to have the state CSTATE_ABORTEDfor client abortion and another state for interrupting the currenttransaction?
I do not understand what "interrupting the current transaction" means.
A transaction is either committed or rollbacked, I do not know about
"interrupted".

I mean that IIUC the server usually only reports the error and you mustmanually send the command "END" or "ROLLBACK" to rollback a failedtransaction.

When it is rollbacked, probably some stats will be
collected in passing, I'm fine with that.

If there is an error in a pgbench script, the transaction is aborted,
which means for me that the script execution is stopped where it was,
and either it is restarted from the beginning (retry) or counted as
failure (not retry, just aborted, really).

If by interrupted you mean that one script begins a transaction and
another ends it, as I said in the review I think that this strange
case should be forbidden, so that all the code and documentation
trying to
manage that can be removed.

Ok!

The current RETRY state does memory allocations to generate a message
with buffer allocation and so on. This looks like a costly anduselessoperation. If the user required "retries", then this is normalbehavior,
the retries are counted and will be printed out in the final report,
and there is no point in printing out every single one of them.
Maybe you want that debugging, but then coslty operations should beguarded.
I think we need these debugging messages because, for example,
Debugging message should cost only when under debug. When not under
debug, there should be no debugging message, and there should be no
cost for building and discarding such messages in the executed code
path beyond
testing whether program is under debug.
if you use the option --latency-limit, you we will never know inadvance whether the serialization/deadlock failure will be retried ornot.
ISTM that it will be shown final report. If I want debug, I ask for
--debug, otherwise I think that the command should do what it was
asked for, i.e. run scripts, collect performance statistics and show
them at the end.

In particular, when running with retries is enabled, the user is
expecting deadlock/serialization errors, so that they are not "errors"
as such for
them.
They also help to understand which limit of retries was violated orhow close we were to these limits during the execution of a specifictransaction. But I agree with you that they are costly and can beskipped if the failure type is never retried. Maybe it is better tosplit them into multiple error function calls?..
Debugging message costs should only be incurred when under --debug,
not otherwise.


Ok! IIUC instead of this part of the code

initPQExpBuffer(&errmsg_buf);
printfPQExpBuffer(&errmsg_buf,
                                  "client %d repeats the failed transaction (try 
%d",
                                  st->id, st->retries + 1);
if (max_tries)
        appendPQExpBuffer(&errmsg_buf, "/%d", max_tries);
if (latency_limit)
{
        appendPQExpBuffer(&errmsg_buf,
                                          ", %.3f%% of the maximum time of tries was 
used",
                                          getLatencyUsed(st, &now));
}
appendPQExpBufferStr(&errmsg_buf, ")\n");
pgbench_error(DEBUG_FAIL, "%s", errmsg_buf.data);
termPQExpBuffer(&errmsg_buf);

can we try something like this?

PGBENCH_ERROR_START(DEBUG_FAIL)
{
        PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
                                  st->id, st->retries + 1);
        if (max_tries)
                PGBENCH_ERROR("/%d", max_tries);
        if (latency_limit)
        {
                PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
                                          getLatencyUsed(st, &now));
        }
        PGBENCH_ERROR(")\n");
}
PGBENCH_ERROR_END();

You have added 20-columns alignment prints. This looks like too muchandgenerates much too large lines. Probably 10 (billion) would beenough.
I have already asked you about this in [2]:
Probably:-)
The variables for the numbers of failures and retries are of typeint64
since the variable for the total number of transactions has the same
type. That's why such a large alignment (as I understand it now,enough
20 characters). Do you prefer floating alignemnts, depending on the
maximum number of failures/retries for any command in any script?
An int64 counter is not likely to reach its limit anytime soon:-) If
the column display limit is ever reached, ISTM that then the text is
just misaligned, which is a minor and rare inconvenience. If very wide
columns are used, then it does not fit my terminal and the report text
will always be wrapped around, which makes it harder to read, every
time.

Ok!

The latency limit to 900 ms try is a bad idea because it takes a lotof time. I did such tests before and they were removed by Tom Lanebecause of determinism and time issues. I would comment this test outfor now.
Ok! If it doesn't bother you - can you tell more about the causes ofthese determinism issues?.. Tests for some other failures that cannotbe retried are already added to 001_pgbench_with_server.pl.
Some farm animals are very slow, so you cannot really assume much
about time one way or another.


Thanks!

I do not understand why there is so much text about in failed sqltransaction stuff, while we are mainly interested in serialization &deadlock errors, and this only falls in some "other" category. Thereseems to be more details about other errors that about deadlocks &serializable errors.
The reporting should focus on what is of interest, either all errors,or some detailed split of these errors.
<...>

* "errors_in_failed_tx" is some subcounter of "errors", for a special
case. Why it is there fails me [I finally understood, and I think it
should be removed, see end of review]. If we wanted to distinguish,
then we should distinguish homogeneously: maybe just count the
different error types, eg have things like "deadlock_errors",
"serializable_errors", "other_errors", "internal_pgbench_errors"whichwould be orthogonal one to the other, and "errors" could berecomputed
from these.
Thank you, I agree with you. Unfortunately each new error type adds anew 1 or 2 columns of maximum width 20 to the per-statement report
The fact that some data are collected does not mean that they should
all be reported in detail. We can have detailed error count and report
the sum of this errors for instance, or have some more
verbose/detailed reports
as options (eg --latencies does just that).


Ok!

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Reply via email to