Hi Luke,

How does this pseudocode interact with regular master operations (alter/drop
etc). Are those operations suspended whenever there is a recovery operation,
or would you go deeper to see if those operations are actually affected by
the recovery and proceed otherwise?

Also conceptually, a DAG is a simplified finite state machine (no loops), so
I suspect the main difference in Doug's proposal is that the dependencies
are formalized and abstracted into the ODG instead of being implicit via
state transitions (ie a collections of if-then-elses).

-Sanjit

On Mon, Aug 3, 2009 at 10:42 AM, Luke <[email protected]> wrote:

>
> I still don't see the why the ODG design simplify things. The argument
> for ODG would be more convincing, if you can walk me through a simple
> example that can demonstrate the necessity and/or simplicity. For
> example: the recovery handling method can be straightforwardly
> implemented with this rough pseudo code:
>
> recover(range_server) {
>  if (down_range_servers.size() == 0)
>    start_recovery(range_server);
>  else
>    restart_recovery(range_server);
> }
>
> start_recovery(range_server) {
>  down_range_servers.add(range_server);
>  rid = get_recovery_id();
>  mml->log_start_recovery(rid, down_range_servers);
>  do_recovery(rid);
> }
>
> restart_recovery() {
>  down_range_servers.add(range_server);
>  old_rid = current_recovery_id;
>  rid = get_recovery_id();
>  mml->log_restart_recovery(old_rid, rid, down_range_servers);
>  do_recovery(rid);
> }
>
> do_recovery(rid) {
>  [root_range, meta_ranges, user_ranges] = get_down_ranges(rid);
>
>  if (root_range)
>    recover_range(rid, root_range);
>
>  parallel foreach(range : meta_ranges)
>    recover_range(rid, range);
>
>  parallel foreach(range : user_ranges)
>    recover_range(rid, range);
> }
>
> recover_range(rid, range) {
>  check_recovery_state(rid);
>  dest = range_scheduler->get_range_server(rid, range);
>  replay_load_range(dest, range);
>  replay_commit_log(dest, range);
>  replay_commit(dest)
>  mml->log_done_recover_range(rid, range);
> }
>
> check_recovery_state(rid) {
>  if (rid != current_recovery_id)
>    throw RecoveryCancelled("recovery restarted");
> }
>
> I just don't see a DAG approach getting any simpler: You'd have to
> create a separate class and setup dependencies explicitly for each
> task; You'll need to traverse the entire graph for every
> event/interrupt; You'll probably need to cancel a subtree explicitly
> for any event/interrupt. In the above simple approach, the
> dependencies are implicitly expressed in the running stack, an
> exception can abort the entire stack as needed when a restart is
> detected.
>
> For tasks like move and split, if the destination range server is not
> available or can't finish due to related metadata not available,
> simply cancel the task and the originating range server can retry
> later (RSML will ensure durability of the transaction.)
>
> On Sun, Aug 2, 2009 at 12:36 AM, Doug Judd<[email protected]> wrote:
> > The way I arrived at this design was via an attempt to simplify things.
> The
> > Master is essentially an interrupt driven system.  It sits idle until it
> > gets asynchronously interrupted by a user command, RangeServer failure,
> or
> > split report.  Each one of these interrupts creates some operation that
> > needs to be performed.  Operations that are in-progress or future
> operations
> > may need to be halted and made dependent on this new operation (e.g.
> > RangeServer recovery).  This is what prompted the dependency graph idea
> > which leads to the following server design:
> >
> > while (true) {
> >   wait_for_interrupt()
> >   pause_execution_engine()
> >   modify_dependency_graph()
> >   restart_execution_engine()
> > }
> >
> > It doesn't get more clean and elegant than that.  All prior design
> attempts
> > lead to a big pile of if statements and while loops.  I can't think of a
> > more simple design than this.
>
> Even the simplest task would require a separate class and explicit
> dependency setup in the ODG design. If a task is not immutable like
> the typical map-reduce jobs, i.e. if it depends on master states,
> e.g., recovery or any nonblocking version of commands etc, the
> application logic is much more obfuscated .
>
> IMHO, a potential better abstraction is an finite state machine. You
> can specify/declare a state machine in a simple language, e.g:
>
> event: range_server_down(range_server):
>  state DEFAULT: { start_recovery(); } -> RECOVERING1
>  state RECOVERING1: { restart_recovery(); }
>  ...
>
> and a compiler can validate and generate an FSM that you can
> instantiate and basically do:
>
> while ((event = wait_for_event()) {
>  fsm->handle(event);
> }
>
> But I think that the hand coded simple approach is easy to implement,
> understand and probably more appropriate for the first release.
>
> __Luke
>
> >
> > - Doug
> >
> > On Sat, Aug 1, 2009 at 2:07 PM, Luke <[email protected]> wrote:
> >>
> >> A DAG is only required when you want to do transactions with dyanmic
> >> dependencies (only known at runtime) in parallel. Let's examine the
> >> examples:
> >>
> >> On Sat, Aug 1, 2009 at 10:29 AM, Doug Judd<[email protected]>
> wrote:
> >> > I agree that for the current version of the ALTER TABLE command, you
> >> > could
> >> > probably implement it in such a way that you wouldn't need this
> >> > mechanism.
> >> > However, there are several other situations that come to mind where
> this
> >> > isn't the case:
> >> >
> >> > 1. We've talked about adding a 'force' option which would cause a
> major
> >> > compaction to occur on each range after it's schema was altered so
> that
> >> > the
> >> > space taken up by dropped columns could get immediately freed up.
> >> > 2. Tyler has been asking for the ability to delete certain portions of
> a
> >> > table (e.g. drop all rows less than a certain value or all cells
> before
> >> > some
> >> > timestamp).  To do this we're going to add a COMPACT command (see
> issue
> >> > 307).
> >> > 3. At some point we'll be adding a SNAPSHOT command.  Since CellStores
> >> > can
> >> > be shared, cheap table snapshots can be made by first performing a
> >> > compaction and then copying the METADATA entries for the table being
> >> > snapshotted.
> >> >
> >> > In all three of these situations, if a RangeServer goes down in the
> >> > middle
> >> > of the operation, when one of the participating ranges in the
> operation
> >> > is
> >> > recovered on a new RangeServer, it won't have enough state to finish
> the
> >> > operation.
> >>
> >> So far all these commands can be implemented using a single
> >> transaction. Let's use compact as an example:
> >>
> >> 1. master: writes begin compact(list of ranges) -> id in MML
> >> 2. for each finished ranges master will get an ack and writes
> >> done_compact(range) in MML
> >> 3. If any of the range server dies the master can restart the
> >> transaction restart compact(remaining ranges) -> id in MML
> >> 4. the above steps (2-3) can happen multiple times, when multiple
> >> range server dies.
> >>
> >> > The place where maintiaining a complex dependency graph is probably
> the
> >> > most
> >> > important is in the RangeServer recovery process itself.  As Sanjit
> >> > pointed
> >> > out, the recovery process needs to happen in roughly three stages,
> ROOT
> >> > range recovery, METADATA range recovery, and lastly USER range
> >> > recovery.  If
> >> > you drill down into the RangeServer recovery process, it can be broken
> >> > down
> >> > into the following four dependent steps:
> >> >
> >> > 1. Replay RSML and assign all the ranges to new RangeServers
> >> > 2. Invoke RangeServer::replay_load_range() for each range assigned to
> a
> >> > RangeServer
> >> > 3. Replay commit log fragments
> >> > 4. Invoke the RangeServer::replay_commit() on each participating
> >> > RangeServer
> >> >
> >> > At any point in this process a RangeServer could go down and if it is
> >> > one of
> >> > the RangeServers participating in the recovery, then things
> >> > complicated.  As
> >> > a general rule, you can't carry out step #4 for USER ranges unless the
> >> > METADATA ranges are available and you can't carry out step #4 for
> >> > METADATA
> >> > ranges until the ROOT range is available.
> >>
> >> Since we only handle one recovery transaction at a time, with the 4
> >> steps being required sub transactions already known:
> >>
> >> 1. when a range server dies, master writes begin
> >> recover(dead_range_server1) -> id in MML
> >> 2. for each ranges in dead_range_server1's metalog, master invokes
> >> replay_load_range and writes done_replay_load_range(range) in MML
> >> 3. if another range server dies, master writes restart
> >> recovery(dead_range_server1, dead_range_server2) -> id in MML. master
> >> can then compute the union of the ranges that need to be recovered by
> >> examine the done_replay_load_range so far and the ranges in both dead
> >> range servers metalogs. and basically does 2 again.
> >> 4. if replay_load_range finishes master can proceed to the next steps.
> >> Any range server death during the following steps would cause a
> >> restart (step 3.)
> >> 5. the above steps can happen multiple times, hopefully with less
> >> ranges to recover each time (otherwise, it's a cascading failure that
> >> would cause the whole cluster to die. Will probably need to send out
> >> alerts when the rate of death accelerates)
> >>
> >> > Ideally the recovery of each USER
> >> > range should only be dependent on the recovery of the METADATA range
> >> > that
> >> > holds its corresponding entry.  Getting the recovery process
> implemented
> >> > optimally will be important for maintaining good 99th percentile
> latency
> >> > in
> >> > large (thousands of nodes) deployments running on a cloud provider
> like
> >> > EC2.
> >>
> >> This is the only place a DAG is useful (but not required) so far. And
> >> you don't need to persist it in MML, you can reconstruct the DAG by
> >> scanning the MML if master dies, because you know the dependencies
> >> before hand (meta before user.)
> >>
> >> > Given all these dependencies, a directed acyclic graph seems like a
> >> > clean
> >> > and natural representation.
> >>
> >> A DAG is a natural representation for parallel transactions with
> >> dependencies, but it really doesn't buy you much at this stage so far,
> >> mostly because the dependencies are not created dynamically by user at
> >> runtime (like in general map/reduce jobs.) The bulk of the work is to
> >> get the recovery transaction right. I'd be very happy if we can do the
> >> parallel replay_load_range and replay_commit_log like the Baidu guys
> >> did. Fine grained parallel range recovery (multiple (meta, user)
> >> transactions) is nice to have and a little premature for the first
> >> release, IMHO.
> >>
> >> __Luke
> >>
> >> >
> >> > - Doug
> >> >
> >> > On Fri, Jul 31, 2009 at 3:49 PM, Luke <[email protected]> wrote:
> >> >>
> >> >> On Fri, Jul 31, 2009 at 3:30 PM, Doug Judd<[email protected]>
> >> >> wrote:
> >> >> > The two examples that I gave are just two of many possible
> scenarios:
> >> >> >
> >> >> > - What if you are in the middle of DROP TABLE and a range server
> goes
> >> >> > down
> >> >> > and then in the middle of that recovery another range server goes
> >> >> > down,
> >> >> > etc.
> >> >>
> >> >> It would still work if the schema is updated in hyperspace and
> >> >> remaining range servers, as long as the destination range servers are
> >> >> doing the right thing: ignoring ranges that belongs to a nonexistent
> >> >> table. The range server can cache negative results of schema queries
> >> >> for a while to avoid hitting hyperspace too much on unseen tables
> >> >> (range server should already do that already to avoid accidental ddos
> >> >> on hyperspace.)
> >> >>
> >> >> > - What if someone does an ALTER TABLE and then before it is
> complete,
> >> >> > one of
> >> >> > the participating range servers goes down and then before the range
> >> >> > server
> >> >> > recovers, one of the range servers participating in the recovery
> goes
> >> >> > down
> >> >> > and then someone does a DROP table ...
> >> >>
> >> >> It really doesn't matter, hyperspace keeps the schema and the
> >> >> remaining range servers get the memo: alter table can finish on the
> >> >> remaining range servers. Same thing for the drop table. The key is to
> >> >> populate the correctly schema lazily when the ranges are recovered.
> >> >>
> >> >> > - What if the Master is in the middle of a MOVE operation and then
> >> >> > the
> >> >> > range
> >> >> > server that the range is being moved to goes down before it reports
> >> >> > back,
> >> >> > then recovery starts and then someone does an ALTER TABLE?
> >> >>
> >> >> > - What if there are 16 overlapping range server failures?
> >> >>
> >> >> As long as the operations are idempotent, all you need to do is just
> >> >> updating schema on hyperpsace and remaining range server atomically.
> >> >>
> >> >> > Trying to program for all of the scenarios would be a total
> nightmare
> >> >> > without some sort of dependency abstraction.  Just because these
> >> >> > scenarios
> >> >> > are rare, doesn't mean we can ignore them.  We have to come up with
> a
> >> >> > clean
> >> >> > design that covers all the possible scenarios.
> >> >>
> >> >> The cleanest design doesn't have to handle these dependencies if
> >> >> possible. The ODG and related stuff would be a debugging nightmare.
> >> >>
> >> >> > This isn't over-engineering.  It is clean engineering.  There is no
> >> >> > way
> >> >> > to
> >> >> > cleanly design the Master without some sort of dependency
> >> >> > abstraction.
> >> >>
> >> >> I still haven't seen a compelling example yet. I'll change my mind
> >> >> when I see it.
> >> >>
> >> >> __Luke
> >> >>
> >> >> > - Doug
> >> >> >
> >> >> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote:
> >> >> >>
> >> >> >> Master is getting more and more like a workqueue and jobtracker :)
> >> >> >> It
> >> >> >> seems to be advantageous to actually create a separate general
> >> >> >> server
> >> >> >> to manage all the tasks, which can be used for schedule map/reduce
> >> >> >> tasks in the future as well.
> >> >> >>
> >> >> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]>
> wrote:
> >> >> >> > The Master is responsible for orchestrating recovery from
> >> >> >> > RangeServer
> >> >> >> > failures as well as carrying out meta operations in response to
> >> >> >> > commands
> >> >> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE.  These meta
> >> >> >> > operations
> >> >> >> > are relatively straightforward except in the face of RangeServer
> >> >> >> > failure.
> >> >> >> > When this happens, any in-progress meta operation that is
> >> >> >> > dependent
> >> >> >> > on
> >> >> >> > the
> >> >> >> > failed RangeServer needs to block until the RangeServer has been
> >> >> >> > recovered.
> >> >> >> > If another RangeServer that is involved in the recovery goes
> down,
> >> >> >> > there
> >> >> >> > is
> >> >> >> > now another recovery operation that needs to get carried out.
> The
> >> >> >> > Master
> >> >> >> > can
> >> >> >> > quickly start building up a fairly complex set of operation
> >> >> >> > dependencies.
> >> >> >> >
> >> >> >> > The master is also responsible for moving ranges from one
> >> >> >> > RangeServer
> >> >> >> > to
> >> >> >> > another when load across the RangeServers gets out of balance.
> If
> >> >> >> > a
> >> >> >> > MOVE
> >> >> >> > RANGE operation is in progress when, say, an ALTER TABLE request
> >> >> >> > arrives,
> >> >> >> > and the range being moved is part of the table specified in the
> >> >> >> > ALTER
> >> >> >> > TABLE
> >> >> >> > request, then the ALTER TABLE operation needs to wait until the
> >> >> >> > MOVE
> >> >> >> > RANGE
> >> >> >> > operation is complete before it can continue.  Also, if two
> ALTER
> >> >> >> > TABLE
> >> >> >> > requests arrive at the Master at the same time, then they should
> >> >> >> > get
> >> >> >> > carried
> >> >> >> > out in sequential order with one of the ALTER TABLE operations
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > the completion of the other operation.
> >> >> >>
> >> >> >> I'm not sure about this particular case. For alter table while
> >> >> >> ranges
> >> >> >> are split/moved, it seems to that me as long as you update the
> >> >> >> schema
> >> >> >> in hyperspace/range servers atomically. The split/moved ranges on
> >> >> >> the
> >> >> >> destination new server will get the right schema. Also two alter
> >> >> >> table
> >> >> >> can overlap in many cases, as long as the schema updates on
> >> >> >> hyperspace/range servers are atomic. For cases where alter table
> on
> >> >> >> the same table needs to be sequenced, it's actually not too much
> to
> >> >> >> ask the application to do the sequence, as alter table is not
> really
> >> >> >> a
> >> >> >> frequent operations (otherwise, they should go with a generic
> column
> >> >> >> family and go nuts on qualifiers.)
> >> >> >>
> >> >> >> > To handle these dependencies, I propose designing the Master as
> an
> >> >> >> > execution
> >> >> >> > engine for a directed acyclic graph of operations or operation
> >> >> >> > dependency
> >> >> >> > graph (ODG).  Each node in the graph would represent an
> operation
> >> >> >> > (e.g.
> >> >> >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic
> state.
> >> >> >> > Execution threads would carry out the operations by picking up
> >> >> >> > nodes
> >> >> >> > from
> >> >> >> > the graph in topological sort order.  When a RangeServer dies,
> the
> >> >> >> > ODG
> >> >> >> > execution engine would pause, a new "RECOVER RangeServer" will
> get
> >> >> >> > created
> >> >> >> > and the ODG will get modified to include this new node.  All of
> >> >> >> > the
> >> >> >> > existing
> >> >> >> > nodes that were dependent on that RangeServer would become
> >> >> >> > dependent
> >> >> >> > on
> >> >> >> > this
> >> >> >> > new RECOVER RangeServer node.  At this point the ODG execution
> >> >> >> > engine
> >> >> >> > would
> >> >> >> > be restarted.
> >> >> >>
> >> >> >> The same alter table arguments can apply here as well. You can let
> >> >> >> the
> >> >> >> alter table to proceed on hyperspace and the remaining range
> >> >> >> servers.
> >> >> >> The recovered ranges would get the right schema. Otherwise, an
> alter
> >> >> >> table command can take a long time (up to a few minutes) while one
> >> >> >> of
> >> >> >> the range server is being recovered.
> >> >> >>
> >> >> >> > The Master Meta Log (MML) would essentially persist any changes
> to
> >> >> >> > the
> >> >> >> > ODG,
> >> >> >> > both node state as well as structural graph changes.  When the
> >> >> >> > Master
> >> >> >> > fails
> >> >> >> > and a new one comes up, it would replay the MML to reconstruct
> the
> >> >> >> > ODG
> >> >> >> > after
> >> >> >> > which it could continue execution.
> >> >> >> >
> >> >> >> > Thoughts?
> >> >> >>
> >> >> >> It seems to me that an ODG is not absolutely required for normal
> >> >> >> Hypertable operations. I'd like to avoid over engineering (if
> >> >> >> possible) for the first release.
> >> >> >>
> >> >> >> __Luke
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >
> >> >
> >> > >
> >> >
> >>
> >>
> >
> >
> > >
> >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to