Re: [HACKERS] Restore-reliability mode

Noah Misch Sat, 06 Jun 2015 12:59:00 -0700

On Fri, Jun 05, 2015 at 08:25:34AM +0100, Simon Riggs wrote:
> This whole idea of "feature development" vs reliability is bogus. It
> implies people that work on features don't care about reliability. Given
> the fact that many of the features are actually about increasing database
> reliability in the event of crashes and corruptions it just makes no sense.


I'm contrasting work that helps to keep our existing promises ("reliability")
with work that makes new promises ("features").  In software development, we
invariably hazard old promises to make new promises; our success hinges on
electing neither too little nor too much risk.  Two years ago, PostgreSQL's
track record had placed it in a good position to invest in new, high-risk,
high-reward promises.  We did that, and we emerged solvent yet carrying an
elevated debt service ratio.  It's time to reduce risk somewhat.

You write about a different sense of "reliability."  (Had I anticipated this
misunderstanding, I might have written "Restore-probity mode.")  None of this
was about classifying people, most of whom allocate substantial time to each
kind of work.

> How will we participate in cleanup efforts? How do we know when something
> has been "cleaned up", how will we measure our success or failure? I think
> we should be clear that wasting N months on cleanup can *fail* to achieve a
> useful objective. Without a clear plan it almost certainly will do so. The
> flip side is that wasting N months will cause great amusement and dancing
> amongst those people who wish to pull ahead of our open source project and
> we should take care not to hand them a victory from an overreaction.

I agree with all that.  We should likewise take care not to become insolvent
from an underreaction.

> So lets do our normal things, not do a "total stop" for an indefinite
> period. If someone has specific things that in their opinion need to be
> addressed, list them and we can talk about doing them, together.

I recommend these four exit criteria:

1. Non-author committer review of foreign keys locks/multixact durability.
   Done when that committer certifies, as if he were committing the patch
   himself today, that the code will not eat data.

2. Non-author committer review of row-level security.  Done when that
   committer certifies that the code keeps its promises and that the
   documentation bounds those promises accurately.

3. Second committer review of the src/backend/access changes for INSERT ... ON
   CONFLICT DO NOTHING/UPDATE.  (Bugs affecting folks who don't use the new
   syntax are most likely to fall in that portion.)  Unlike the previous two
   criteria, a review without certification is sufficient.

4. Non-author committer certifying that the 9.5 WAL format changes will not
   eat your data.  The patch lists Andres and Alvaro as reviewers; if they
   already reviewed it enough to make that certification, this one is easy.

That ties up four people.  For everyone else:

- Fix bugs those reviews find.  This will start slow but will grow to keep
  everyone busy.  Committers won't certify code, and thus we can't declare
  victory, until these bugs are fixed.  The rest of this list, in contrast,
  calls out topics to sample from, not topics to exhaust.

- Turn current buildfarm members green.

- Write, review and commit more automated test machinery to PostgreSQL.  Test
  whatever excites you.  If you need ideas, Craig posted some good ones
  upthread.  Here are a few more:
  - Add a debug mode that calls sched_yield() in SpinLockRelease(); see
    6322.1406219...@sss.pgh.pa.us.
  - Improve TAP suite (src/test/perl/TestLib.pm) logging.  Currently, these
    suites redirect much output to /dev/null.  Instead, log that output and
    teach the buildfarm to capture the log.
  - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
    count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
    when its global pin count falls to zero.
  - With assertions enabled, or perhaps in a new debug mode, have
    pg_do_encoding_conversion() and pg_server_to_any() check the data for a
    no-op conversion instead of assuming the data is valid.

- Add buildfarm members.  This entails reporting any bugs that prevent an
  initial passing run.  Once you have a passing run, schedule regular runs.
  Examples of useful additions:
  - "./configure ac_cv_func_getopt_long=no, ac_cv_func_snprintf=no ..." to
    enable all the replacement code regardless of the current platform's need
    for it.  This helps distinguish "Windows bug" from "replacement code bug."
  - --disable-integer-datetimes, --disable-float8-byval, disable-float4-byval,
    --disable-spinlocks, --disable-atomics, disable-thread-safety,
    --disable-largefile, #define RANDOMIZE_ALLOCATED_MEMORY
  - Any OS or CPU architecture other than x86 GNU/Linux, even ones already
    represented.

- Write, review and commit fixes for the bugs that come to light by way of
  these new automated tests.

- Anything else targeted to make PostgreSQL keep the promises it has already
  made to our users.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restore-reliability mode

Reply via email to