Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

2021-06-01 Thread Sergei Golubchik
Hi, Andrei!

On Jun 01, Andrei Elkin wrote:
>  
>  It's actually turns out not to be easy. A sequence of execution events
>  
>    E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
>  
>  does not make E1 such as Innodb committed persistently in file.
>  At recovery in E1 trx may be found in prepared state, unless
>  before the crash I'd do something like Binlog CheckPoint (BCP)
>  notification request and wait of it.
> >>>
> The committed state is done asynchronously (by virtue of MDEV-232 that
> implements a part of BCP).

Oh. Indeed. Sorry, I've missed that.

Still a prepare of another transaction can force this commit to disk, as
far as I understand. So what about this:

  you stop the thread on a debug sync after the first commit

  then another connection starts committing in InnoDB, it stops on a
  debug sync after the prepare - before anything is written into binlog

  then you crash.

looks like it'll create the case you need

> >> No doubt it's technically possible to wait for the flushing and
> >> crash, but still does not look easy.
> >> Consider if we defer this sort of mtr test to MDEV-18959?
> 
> Sujatha just contributed `74a13b4de2` with an mtr test that crashes
> after the asynchronous fsync is done to the partially (2 engine case)
> committed state.
...
> Aslo now that `74a13b4de2` provides with a method to crash after the
> "commmit" fsync we might also employ it for RQG.

No, this is very different. This is a crash happening where Sujatha
expects it to happen, there's no value in that now, it's needed for
regression testing, to make sure the recovery won't be broken in the
future.

But it makes no sense to keep testing now how recovery works in case of
a crash happening at exactly the place where Sujatha wanted it to happen
and after she verified that crash at that exactly line does not break
the recovery. The whole point of RQG tests is to try crashes where you
did not expect them to happen. Adding DBUG_EXECUTE_IF() doesn't help
this goal at all.

Regards,
Sergei
VP of MariaDB Server Engineering
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

2021-05-31 Thread Sergei Golubchik
Hi, Andrei!

On May 27, Andrei Elkin wrote:
>>> 
>>> I had a rather sobering followup, slipped communicating to you
>>> though, sorry.
>>> 
>>> It's actually turns out not to be easy. A sequence of execution events
>>> 
>>>   E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
>>> 
>>> does not make E1 such as Innodb committed persistently in file.  At
>>> recovery in E1 trx may be found in prepared state, unless before the
>>> crash I'd do something like Binlog CheckPoint (BCP) notification request
>>> and wait of it.
>>
>> I don't understand, what a binlog checkpoint has to do with it?
> 
> So E1=Innodb.
> In the above case in order to create partially committed trx
>   Innodb.commit(trx)
> must be followed with flushing trx to disk, which is typically
> asynchronous.
> 
> I mentioned BCP 'cos it ensures a trx (actually the last in a binlog
> file) is flushed.

With --innodb-flush-log-at-trx-commit=1 it should be perfectly
synnchronous, should it not?

> >> In this just a selected engine would be requested, unlike in BCP
> >> event case that is generated when both (in this case) have
> >> persistently committed. (That is we even can't involve BCP event to
> >> wait for, and the involvement would require some simulation as
> >> well).
> >> 
> >> So I propose to keep the current method leaving the "honest" crash
> >> to RQG that Alice has been training the patch with.
> >
> > Alice was loading the server with transactions and doing `kill -9`
> > at some point. Many interesting cases that we discuss (for example,
> > the one above) happen in a very small time window, so I don't know
> > whether random killing could provide an adequate test coverage.
> 
> No doubt it's technically possible to wait for the flushing and crash,
> but still does not look easy.
> Consider if we defer this sort of mtr test to MDEV-18959?

But we've spent numerous hours and emails discussing and you've spend
countless days and weeks implementing various tricky corner cases are
are and won't be ever tested.

Here's an idea of how to test it.
We're only interested in crashes during commit. So
* when starting a commit - atomically increment some in_commit counter
* on leaving the commit - decrement it
* on signal (some unused one) - crash the server if in_commit > 0
* after the crash - get the stack traces from the core to see where in the 
commit did it crash.

I've attached a small patch to illustrate this idea.

Hopefully, it'll cover everything, but if not, then after analyzing
stack traces we'll see where it did not crash, and move atomic
increment/decrements to be around this area only.

> >> >> diff --git a/sql/handler.h b/sql/handler.h
> >> >> --- a/sql/handler.h
> >> >> +++ b/sql/handler.h
> >> >> @@ -873,6 +874,15 @@ typedef struct xid_t XID;
> >> >>  /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
> >> >>  uint get_sql_xid(XID *xid, char *buf);
> >> >>  
> >> >> +/* struct for semisync slave binlog truncate recovery */
> >> >> +struct xid_recovery_member
> >> >> +{
> >> >> +  my_xid xid;
> >> >> +  uint in_engine_prepare;  // number of engines that have xid prepared
> >> >> +  bool decided_to_commit;
> >> >> +  std::pair binlog_coord; // semisync recovery binlog 
> >> >> offset
> >> >
> >> > wouldn't it be clearer to have a struct with named members?
> >> 
> >> I am not sure if there's a need for naming. But we do need
> >> comparison that std::pair provides for granted.
> >> 
> >> I could consider (a) new typedef(s) to represent the binlog id and
> >> offset in the file to make it more readable.
> >> 
> >> You say?
> >> 
> >> > in fact, I'm somewhat surprised there's no such struct for binlog coords
> >> > already.
> >> 
> >> I had started that back at HB implementation actually, but that struct
> >> did not attact much of attention.
> >
> 
> > I'd still suggest a structure, like
> >
> > struct binlog_coords_t {
> >   uint binlog_num;
> >   my_off_t offset;
> > };
> 
> Well, with this, like I said about std:pair, we'd also have to have
>   bool operator < (const binlog_coords_t)

like

  bool operator <(const binlog_coords_t c)
  { return binlog_num < c.binlog_num || binlog_num == c.binlog_num && offset < 
c.offset; }

> I seriously rate refuse to reuse as a deadly sin of programming.
> (Unlike normal sins it may be soften by realization though :-)).
> 
> Let's waive it away, right :-)?
> 
> Hoping on that `72a59e4b068` already defined the following:
...
> +/*
> +  Compound binlog-id and byte offset of transaction's first event
> +  in a sequence (e.g the recovery sequence) of binlog files.
> +  Binlog_offset(0,0) is the minimum value to mean
> +  the first byte of the first binlog file.
> +*/
> +typedef std::pair Binlog_offset;

This is better than writing std::pair everywhere.
Not as clear as a binlog_coords_t from above, but _perhaps_
it'll do, let's see how it'll be used.

Regards,
Sergei
VP of MariaDB Server Engineering
and secur...@mariadb.org
diff --git a/sql/handler.cc b/sql/handler.cc
--- 

Re: [Maria-developers] e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

2021-05-26 Thread Sergei Golubchik
Hi, Andrei!

On Apr 27, Andrei Elkin wrote:
> >> +--let $case = B: one engine has committed its transaction branch
> >> +# Hold off after one engine has committed.
> >> +--let $shutdown_timeout=0
> >> +--let $debug_sync_action = "commit_after_run_commit_ordered SIGNAL 
> >> con1_ready WAIT_FOR signal_no_signal"
> >> +# Both debug_sync and debug-dbug are required to make sure Engines 
> >> remember the commit state
> >> +# debug_sync alone will not help.
> >> +--let $restart_parameters = --rpl-semi-sync-slave-enabled=1 
> >> --debug-dbug=d,binlog_truncate_partial_commit
> >
> > in the first review I wrote
> >
> >   this seems to be a rather crude way of faking a partially committed
> >   transaction. better to crash after the first engine has committed,
> >   that'd be much more natural.
> >
> > and you replied
> >
> >   This simulation aimed at (allows for) more complicated recovery time
> >   event sequences.
> >   In this case, indeed, crashing by demand is about of the same efforts.
> >   I can convert to that.
> >
> >   [x]
> 
> I had a rather sobering followup, slipped communicating to you though, sorry.
> 
> It's actually turns out not to be easy. A sequence of execution events
> 
>   E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
> 
> does not make E1 such as Innodb committed persistently in file.  At
> recovery in E1 trx may be found in prepared state, unless before the
> crash I'd do something like Binlog CheckPoint (BCP) notification request
> and wait of it.

I don't understand, what a binlog checkpoint has to do with it?

> In this just a selected engine would be requested, unlike in BCP event
> case that is generated when both (in this case) have persistently committed.
> (That is we even can't involve BCP event to wait for, and the involvement
> would require some simulation as well).
> 
> So I propose to keep the current method leaving the "honest" crash to
> RQG that Alice has been training the patch with.

Alice was loading the server with transactions and doing `kill -9` at
some point. Many interesting cases that we discuss (for example, the one
above) happen in a very small time window, so I don't know whether
random killing could provide an adequate test coverage.

> >> diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test 
> >> b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> --- /dev/null
> >> +++ b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> @@ -0,0 +1,143 @@
> >> +
> >> +--source include/have_innodb.inc
> >> +--source include/have_debug_sync.inc
> >> +--source include/have_binlog_format_row.inc
> >> +--let $rpl_topology=1->2
> >> +--source include/rpl_init.inc
> >
> > why not to source master-slave.inc if you're using a standard master-slave
> > topology anyway?
> 
> Using numbers it's easier to manipulate with failover operatations.
> Besides it's truly uncomfortable both read and write
> when `master` would be serving the slave role as it happens in the test :-).

I believe I already replied to that earlier.
master-slave.inc creates server_1 and server_2 connections all right.

> >> diff --git a/sql/handler.h b/sql/handler.h
> >> --- a/sql/handler.h
> >> +++ b/sql/handler.h
> >> @@ -873,6 +874,15 @@ typedef struct xid_t XID;
> >>  /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
> >>  uint get_sql_xid(XID *xid, char *buf);
> >>  
> >> +/* struct for semisync slave binlog truncate recovery */
> >> +struct xid_recovery_member
> >> +{
> >> +  my_xid xid;
> >> +  uint in_engine_prepare;  // number of engines that have xid prepared
> >> +  bool decided_to_commit;
> >> +  std::pair binlog_coord; // semisync recovery binlog 
> >> offset
> >
> > wouldn't it be clearer to have a struct with named members?
> 
> I am not sure if there's a need for naming. But we do need
> comparison that std::pair provides for granted.
> 
> I could consider (a) new typedef(s) to represent the binlog id and
> offset in the file to make it more readable.
> 
> You say?
> 
> > in fact, I'm somewhat surprised there's no such struct for binlog coords
> > already.
> 
> I had started that back at HB implementation actually, but that struct
> did not attact much of attention.

I'd still suggest a structure, like

struct binlog_coords_t {
  uint binlog_num;
  my_off_t offset;
};

> >
> >> +};
> >> +
> >>  /* for recover() handlerton call */
> >>  #define MIN_XID_LIST_SIZE  128
> >>  #define MAX_XID_LIST_SIZE  (1024*128)
> >> @@ -4820,7 +4830,8 @@ int ha_commit_one_phase(THD *thd, bool all);
> >>  int ha_commit_trans(THD *thd, bool all);
> >>  int ha_rollback_trans(THD *thd, bool all);
> >>  int ha_prepare(THD *thd);
> >> -int ha_recover(HASH *commit_list);
> >> +int ha_recover(HASH *commit_list, MEM_ROOT *mem_root= NULL);
> >> +uint ha_recover_complete(HASH *commit_list, std::pair 
> >> *coord= NULL);
> >
> > is coord a truncation position?
> 
> So NULL is for the normal recovery.
> Non-NULL is an estimate of the 1st to rollback.
> Trx with lesser offsets