[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Luke Mon, 03 Aug 2009 11:38:35 -0700

On Mon, Aug 3, 2009 at 11:14 AM, Sanjit Jhala<[email protected]> wrote:
> Hi Luke,
>
> How does this pseudocode interact with regular master operations (alter/drop
> etc). Are those operations suspended whenever there is a recovery operation,
> or would you go deeper to see if those operations are actually affected by
> the recovery and proceed otherwise?


As I mentioned before, alter table, and drop table can implemented in
a nonblocking fashion (regardless of recovery) by

1. master ignores the down range servers
2. destination range server refreshes schema from hyperspace for
replay the ranges.

We can do this because hyperspace is assumed to be always available
and has the final word.

The only operations that would need to be retried is range move and
split, which needs to be retried by the originating range server in
case of errors (destination range server down or metadata not ready)

> Also conceptually, a DAG is a simplified finite state machine (no loops), so
> I suspect the main difference in Doug's proposal is that the dependencies
> are formalized and abstracted into the ODG instead of being implicit via
> state transitions (ie a collections of if-then-elses).
>
> -Sanjit

The key difference is that FSM can be statically validated/proved and
compiled, the ODG approach is dynamic and the dependencies needs to be
setup at runtime, which is error prone. I can definitely see the where
he is coming from, but I'm not convinced that going with ODG route
(manipulate actually would actually simplify things, yet.

__Luke

> On Mon, Aug 3, 2009 at 10:42 AM, Luke <[email protected]> wrote:
>>
>> I still don't see the why the ODG design simplify things. The argument
>> for ODG would be more convincing, if you can walk me through a simple
>> example that can demonstrate the necessity and/or simplicity. For
>> example: the recovery handling method can be straightforwardly
>> implemented with this rough pseudo code:
>>
>> recover(range_server) {
>>  if (down_range_servers.size() == 0)
>>    start_recovery(range_server);
>>  else
>>    restart_recovery(range_server);
>> }
>>
>> start_recovery(range_server) {
>>  down_range_servers.add(range_server);
>>  rid = get_recovery_id();
>>  mml->log_start_recovery(rid, down_range_servers);
>>  do_recovery(rid);
>> }
>>
>> restart_recovery() {
>>  down_range_servers.add(range_server);
>>  old_rid = current_recovery_id;
>>  rid = get_recovery_id();
>>  mml->log_restart_recovery(old_rid, rid, down_range_servers);
>>  do_recovery(rid);
>> }
>>
>> do_recovery(rid) {
>>  [root_range, meta_ranges, user_ranges] = get_down_ranges(rid);
>>
>>  if (root_range)
>>    recover_range(rid, root_range);
>>
>>  parallel foreach(range : meta_ranges)
>>    recover_range(rid, range);
>>
>>  parallel foreach(range : user_ranges)
>>    recover_range(rid, range);
>> }
>>
>> recover_range(rid, range) {
>>  check_recovery_state(rid);
>>  dest = range_scheduler->get_range_server(rid, range);
>>  replay_load_range(dest, range);
>>  replay_commit_log(dest, range);
>>  replay_commit(dest)
>>  mml->log_done_recover_range(rid, range);
>> }
>>
>> check_recovery_state(rid) {
>>  if (rid != current_recovery_id)
>>    throw RecoveryCancelled("recovery restarted");
>> }
>>
>> I just don't see a DAG approach getting any simpler: You'd have to
>> create a separate class and setup dependencies explicitly for each
>> task; You'll need to traverse the entire graph for every
>> event/interrupt; You'll probably need to cancel a subtree explicitly
>> for any event/interrupt. In the above simple approach, the
>> dependencies are implicitly expressed in the running stack, an
>> exception can abort the entire stack as needed when a restart is
>> detected.
>>
>> For tasks like move and split, if the destination range server is not
>> available or can't finish due to related metadata not available,
>> simply cancel the task and the originating range server can retry
>> later (RSML will ensure durability of the transaction.)
>>
>> On Sun, Aug 2, 2009 at 12:36 AM, Doug Judd<[email protected]> wrote:
>> > The way I arrived at this design was via an attempt to simplify things.
>> > The
>> > Master is essentially an interrupt driven system.  It sits idle until it
>> > gets asynchronously interrupted by a user command, RangeServer failure,
>> > or
>> > split report.  Each one of these interrupts creates some operation that
>> > needs to be performed.  Operations that are in-progress or future
>> > operations
>> > may need to be halted and made dependent on this new operation (e.g.
>> > RangeServer recovery).  This is what prompted the dependency graph idea
>> > which leads to the following server design:
>> >
>> > while (true) {
>> >   wait_for_interrupt()
>> >   pause_execution_engine()
>> >   modify_dependency_graph()
>> >   restart_execution_engine()
>> > }
>> >
>> > It doesn't get more clean and elegant than that.  All prior design
>> > attempts
>> > lead to a big pile of if statements and while loops.  I can't think of a
>> > more simple design than this.
>>
>> Even the simplest task would require a separate class and explicit
>> dependency setup in the ODG design. If a task is not immutable like
>> the typical map-reduce jobs, i.e. if it depends on master states,
>> e.g., recovery or any nonblocking version of commands etc, the
>> application logic is much more obfuscated .
>>
>> IMHO, a potential better abstraction is an finite state machine. You
>> can specify/declare a state machine in a simple language, e.g:
>>
>> event: range_server_down(range_server):
>>  state DEFAULT: { start_recovery(); } -> RECOVERING1
>>  state RECOVERING1: { restart_recovery(); }
>>  ...
>>
>> and a compiler can validate and generate an FSM that you can
>> instantiate and basically do:
>>
>> while ((event = wait_for_event()) {
>>  fsm->handle(event);
>> }
>>
>> But I think that the hand coded simple approach is easy to implement,
>> understand and probably more appropriate for the first release.
>>
>> __Luke
>>
>> >
>> > - Doug
>> >
>> > On Sat, Aug 1, 2009 at 2:07 PM, Luke <[email protected]> wrote:
>> >>
>> >> A DAG is only required when you want to do transactions with dyanmic
>> >> dependencies (only known at runtime) in parallel. Let's examine the
>> >> examples:
>> >>
>> >> On Sat, Aug 1, 2009 at 10:29 AM, Doug Judd<[email protected]>
>> >> wrote:
>> >> > I agree that for the current version of the ALTER TABLE command, you
>> >> > could
>> >> > probably implement it in such a way that you wouldn't need this
>> >> > mechanism.
>> >> > However, there are several other situations that come to mind where
>> >> > this
>> >> > isn't the case:
>> >> >
>> >> > 1. We've talked about adding a 'force' option which would cause a
>> >> > major
>> >> > compaction to occur on each range after it's schema was altered so
>> >> > that
>> >> > the
>> >> > space taken up by dropped columns could get immediately freed up.
>> >> > 2. Tyler has been asking for the ability to delete certain portions
>> >> > of a
>> >> > table (e.g. drop all rows less than a certain value or all cells
>> >> > before
>> >> > some
>> >> > timestamp).  To do this we're going to add a COMPACT command (see
>> >> > issue
>> >> > 307).
>> >> > 3. At some point we'll be adding a SNAPSHOT command.  Since
>> >> > CellStores
>> >> > can
>> >> > be shared, cheap table snapshots can be made by first performing a
>> >> > compaction and then copying the METADATA entries for the table being
>> >> > snapshotted.
>> >> >
>> >> > In all three of these situations, if a RangeServer goes down in the
>> >> > middle
>> >> > of the operation, when one of the participating ranges in the
>> >> > operation
>> >> > is
>> >> > recovered on a new RangeServer, it won't have enough state to finish
>> >> > the
>> >> > operation.
>> >>
>> >> So far all these commands can be implemented using a single
>> >> transaction. Let's use compact as an example:
>> >>
>> >> 1. master: writes begin compact(list of ranges) -> id in MML
>> >> 2. for each finished ranges master will get an ack and writes
>> >> done_compact(range) in MML
>> >> 3. If any of the range server dies the master can restart the
>> >> transaction restart compact(remaining ranges) -> id in MML
>> >> 4. the above steps (2-3) can happen multiple times, when multiple
>> >> range server dies.
>> >>
>> >> > The place where maintiaining a complex dependency graph is probably
>> >> > the
>> >> > most
>> >> > important is in the RangeServer recovery process itself.  As Sanjit
>> >> > pointed
>> >> > out, the recovery process needs to happen in roughly three stages,
>> >> > ROOT
>> >> > range recovery, METADATA range recovery, and lastly USER range
>> >> > recovery.  If
>> >> > you drill down into the RangeServer recovery process, it can be
>> >> > broken
>> >> > down
>> >> > into the following four dependent steps:
>> >> >
>> >> > 1. Replay RSML and assign all the ranges to new RangeServers
>> >> > 2. Invoke RangeServer::replay_load_range() for each range assigned to
>> >> > a
>> >> > RangeServer
>> >> > 3. Replay commit log fragments
>> >> > 4. Invoke the RangeServer::replay_commit() on each participating
>> >> > RangeServer
>> >> >
>> >> > At any point in this process a RangeServer could go down and if it is
>> >> > one of
>> >> > the RangeServers participating in the recovery, then things
>> >> > complicated.  As
>> >> > a general rule, you can't carry out step #4 for USER ranges unless
>> >> > the
>> >> > METADATA ranges are available and you can't carry out step #4 for
>> >> > METADATA
>> >> > ranges until the ROOT range is available.
>> >>
>> >> Since we only handle one recovery transaction at a time, with the 4
>> >> steps being required sub transactions already known:
>> >>
>> >> 1. when a range server dies, master writes begin
>> >> recover(dead_range_server1) -> id in MML
>> >> 2. for each ranges in dead_range_server1's metalog, master invokes
>> >> replay_load_range and writes done_replay_load_range(range) in MML
>> >> 3. if another range server dies, master writes restart
>> >> recovery(dead_range_server1, dead_range_server2) -> id in MML. master
>> >> can then compute the union of the ranges that need to be recovered by
>> >> examine the done_replay_load_range so far and the ranges in both dead
>> >> range servers metalogs. and basically does 2 again.
>> >> 4. if replay_load_range finishes master can proceed to the next steps.
>> >> Any range server death during the following steps would cause a
>> >> restart (step 3.)
>> >> 5. the above steps can happen multiple times, hopefully with less
>> >> ranges to recover each time (otherwise, it's a cascading failure that
>> >> would cause the whole cluster to die. Will probably need to send out
>> >> alerts when the rate of death accelerates)
>> >>
>> >> > Ideally the recovery of each USER
>> >> > range should only be dependent on the recovery of the METADATA range
>> >> > that
>> >> > holds its corresponding entry.  Getting the recovery process
>> >> > implemented
>> >> > optimally will be important for maintaining good 99th percentile
>> >> > latency
>> >> > in
>> >> > large (thousands of nodes) deployments running on a cloud provider
>> >> > like
>> >> > EC2.
>> >>
>> >> This is the only place a DAG is useful (but not required) so far. And
>> >> you don't need to persist it in MML, you can reconstruct the DAG by
>> >> scanning the MML if master dies, because you know the dependencies
>> >> before hand (meta before user.)
>> >>
>> >> > Given all these dependencies, a directed acyclic graph seems like a
>> >> > clean
>> >> > and natural representation.
>> >>
>> >> A DAG is a natural representation for parallel transactions with
>> >> dependencies, but it really doesn't buy you much at this stage so far,
>> >> mostly because the dependencies are not created dynamically by user at
>> >> runtime (like in general map/reduce jobs.) The bulk of the work is to
>> >> get the recovery transaction right. I'd be very happy if we can do the
>> >> parallel replay_load_range and replay_commit_log like the Baidu guys
>> >> did. Fine grained parallel range recovery (multiple (meta, user)
>> >> transactions) is nice to have and a little premature for the first
>> >> release, IMHO.
>> >>
>> >> __Luke
>> >>
>> >> >
>> >> > - Doug
>> >> >
>> >> > On Fri, Jul 31, 2009 at 3:49 PM, Luke <[email protected]> wrote:
>> >> >>
>> >> >> On Fri, Jul 31, 2009 at 3:30 PM, Doug Judd<[email protected]>
>> >> >> wrote:
>> >> >> > The two examples that I gave are just two of many possible
>> >> >> > scenarios:
>> >> >> >
>> >> >> > - What if you are in the middle of DROP TABLE and a range server
>> >> >> > goes
>> >> >> > down
>> >> >> > and then in the middle of that recovery another range server goes
>> >> >> > down,
>> >> >> > etc.
>> >> >>
>> >> >> It would still work if the schema is updated in hyperspace and
>> >> >> remaining range servers, as long as the destination range servers
>> >> >> are
>> >> >> doing the right thing: ignoring ranges that belongs to a nonexistent
>> >> >> table. The range server can cache negative results of schema queries
>> >> >> for a while to avoid hitting hyperspace too much on unseen tables
>> >> >> (range server should already do that already to avoid accidental
>> >> >> ddos
>> >> >> on hyperspace.)
>> >> >>
>> >> >> > - What if someone does an ALTER TABLE and then before it is
>> >> >> > complete,
>> >> >> > one of
>> >> >> > the participating range servers goes down and then before the
>> >> >> > range
>> >> >> > server
>> >> >> > recovers, one of the range servers participating in the recovery
>> >> >> > goes
>> >> >> > down
>> >> >> > and then someone does a DROP table ...
>> >> >>
>> >> >> It really doesn't matter, hyperspace keeps the schema and the
>> >> >> remaining range servers get the memo: alter table can finish on the
>> >> >> remaining range servers. Same thing for the drop table. The key is
>> >> >> to
>> >> >> populate the correctly schema lazily when the ranges are recovered.
>> >> >>
>> >> >> > - What if the Master is in the middle of a MOVE operation and then
>> >> >> > the
>> >> >> > range
>> >> >> > server that the range is being moved to goes down before it
>> >> >> > reports
>> >> >> > back,
>> >> >> > then recovery starts and then someone does an ALTER TABLE?
>> >> >>
>> >> >> > - What if there are 16 overlapping range server failures?
>> >> >>
>> >> >> As long as the operations are idempotent, all you need to do is just
>> >> >> updating schema on hyperpsace and remaining range server atomically.
>> >> >>
>> >> >> > Trying to program for all of the scenarios would be a total
>> >> >> > nightmare
>> >> >> > without some sort of dependency abstraction.  Just because these
>> >> >> > scenarios
>> >> >> > are rare, doesn't mean we can ignore them.  We have to come up
>> >> >> > with a
>> >> >> > clean
>> >> >> > design that covers all the possible scenarios.
>> >> >>
>> >> >> The cleanest design doesn't have to handle these dependencies if
>> >> >> possible. The ODG and related stuff would be a debugging nightmare.
>> >> >>
>> >> >> > This isn't over-engineering.  It is clean engineering.  There is
>> >> >> > no
>> >> >> > way
>> >> >> > to
>> >> >> > cleanly design the Master without some sort of dependency
>> >> >> > abstraction.
>> >> >>
>> >> >> I still haven't seen a compelling example yet. I'll change my mind
>> >> >> when I see it.
>> >> >>
>> >> >> __Luke
>> >> >>
>> >> >> > - Doug
>> >> >> >
>> >> >> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote:
>> >> >> >>
>> >> >> >> Master is getting more and more like a workqueue and jobtracker
>> >> >> >> :)
>> >> >> >> It
>> >> >> >> seems to be advantageous to actually create a separate general
>> >> >> >> server
>> >> >> >> to manage all the tasks, which can be used for schedule
>> >> >> >> map/reduce
>> >> >> >> tasks in the future as well.
>> >> >> >>
>> >> >> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]>
>> >> >> >> wrote:
>> >> >> >> > The Master is responsible for orchestrating recovery from
>> >> >> >> > RangeServer
>> >> >> >> > failures as well as carrying out meta operations in response to
>> >> >> >> > commands
>> >> >> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE.  These meta
>> >> >> >> > operations
>> >> >> >> > are relatively straightforward except in the face of
>> >> >> >> > RangeServer
>> >> >> >> > failure.
>> >> >> >> > When this happens, any in-progress meta operation that is
>> >> >> >> > dependent
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > failed RangeServer needs to block until the RangeServer has
>> >> >> >> > been
>> >> >> >> > recovered.
>> >> >> >> > If another RangeServer that is involved in the recovery goes
>> >> >> >> > down,
>> >> >> >> > there
>> >> >> >> > is
>> >> >> >> > now another recovery operation that needs to get carried out.
>> >> >> >> > The
>> >> >> >> > Master
>> >> >> >> > can
>> >> >> >> > quickly start building up a fairly complex set of operation
>> >> >> >> > dependencies.
>> >> >> >> >
>> >> >> >> > The master is also responsible for moving ranges from one
>> >> >> >> > RangeServer
>> >> >> >> > to
>> >> >> >> > another when load across the RangeServers gets out of balance.
>> >> >> >> > If
>> >> >> >> > a
>> >> >> >> > MOVE
>> >> >> >> > RANGE operation is in progress when, say, an ALTER TABLE
>> >> >> >> > request
>> >> >> >> > arrives,
>> >> >> >> > and the range being moved is part of the table specified in the
>> >> >> >> > ALTER
>> >> >> >> > TABLE
>> >> >> >> > request, then the ALTER TABLE operation needs to wait until the
>> >> >> >> > MOVE
>> >> >> >> > RANGE
>> >> >> >> > operation is complete before it can continue.  Also, if two
>> >> >> >> > ALTER
>> >> >> >> > TABLE
>> >> >> >> > requests arrive at the Master at the same time, then they
>> >> >> >> > should
>> >> >> >> > get
>> >> >> >> > carried
>> >> >> >> > out in sequential order with one of the ALTER TABLE operations
>> >> >> >> > depending
>> >> >> >> > on
>> >> >> >> > the completion of the other operation.
>> >> >> >>
>> >> >> >> I'm not sure about this particular case. For alter table while
>> >> >> >> ranges
>> >> >> >> are split/moved, it seems to that me as long as you update the
>> >> >> >> schema
>> >> >> >> in hyperspace/range servers atomically. The split/moved ranges on
>> >> >> >> the
>> >> >> >> destination new server will get the right schema. Also two alter
>> >> >> >> table
>> >> >> >> can overlap in many cases, as long as the schema updates on
>> >> >> >> hyperspace/range servers are atomic. For cases where alter table
>> >> >> >> on
>> >> >> >> the same table needs to be sequenced, it's actually not too much
>> >> >> >> to
>> >> >> >> ask the application to do the sequence, as alter table is not
>> >> >> >> really
>> >> >> >> a
>> >> >> >> frequent operations (otherwise, they should go with a generic
>> >> >> >> column
>> >> >> >> family and go nuts on qualifiers.)
>> >> >> >>
>> >> >> >> > To handle these dependencies, I propose designing the Master as
>> >> >> >> > an
>> >> >> >> > execution
>> >> >> >> > engine for a directed acyclic graph of operations or operation
>> >> >> >> > dependency
>> >> >> >> > graph (ODG).  Each node in the graph would represent an
>> >> >> >> > operation
>> >> >> >> > (e.g.
>> >> >> >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic
>> >> >> >> > state.
>> >> >> >> > Execution threads would carry out the operations by picking up
>> >> >> >> > nodes
>> >> >> >> > from
>> >> >> >> > the graph in topological sort order.  When a RangeServer dies,
>> >> >> >> > the
>> >> >> >> > ODG
>> >> >> >> > execution engine would pause, a new "RECOVER RangeServer" will
>> >> >> >> > get
>> >> >> >> > created
>> >> >> >> > and the ODG will get modified to include this new node.  All of
>> >> >> >> > the
>> >> >> >> > existing
>> >> >> >> > nodes that were dependent on that RangeServer would become
>> >> >> >> > dependent
>> >> >> >> > on
>> >> >> >> > this
>> >> >> >> > new RECOVER RangeServer node.  At this point the ODG execution
>> >> >> >> > engine
>> >> >> >> > would
>> >> >> >> > be restarted.
>> >> >> >>
>> >> >> >> The same alter table arguments can apply here as well. You can
>> >> >> >> let
>> >> >> >> the
>> >> >> >> alter table to proceed on hyperspace and the remaining range
>> >> >> >> servers.
>> >> >> >> The recovered ranges would get the right schema. Otherwise, an
>> >> >> >> alter
>> >> >> >> table command can take a long time (up to a few minutes) while
>> >> >> >> one
>> >> >> >> of
>> >> >> >> the range server is being recovered.
>> >> >> >>
>> >> >> >> > The Master Meta Log (MML) would essentially persist any changes
>> >> >> >> > to
>> >> >> >> > the
>> >> >> >> > ODG,
>> >> >> >> > both node state as well as structural graph changes.  When the
>> >> >> >> > Master
>> >> >> >> > fails
>> >> >> >> > and a new one comes up, it would replay the MML to reconstruct
>> >> >> >> > the
>> >> >> >> > ODG
>> >> >> >> > after
>> >> >> >> > which it could continue execution.
>> >> >> >> >
>> >> >> >> > Thoughts?
>> >> >> >>
>> >> >> >> It seems to me that an ODG is not absolutely required for normal
>> >> >> >> Hypertable operations. I'd like to avoid over engineering (if
>> >> >> >> possible) for the first release.
>> >> >> >>
>> >> >> >> __Luke
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > >
>> >> >
>> >>
>> >>
>> >
>> >
>> > >
>> >
>>
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Reply via email to