[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Doug Judd Sat, 01 Aug 2009 10:29:39 -0700

I agree that for the current version of the ALTER TABLE command, you could
probably implement it in such a way that you wouldn't need this mechanism.
However, there are several other situations that come to mind where this
isn't the case:


1. We've talked about adding a 'force' option which would cause a major
compaction to occur on each range after it's schema was altered so that the
space taken up by dropped columns could get immediately freed up.
2. Tyler has been asking for the ability to delete certain portions of a
table (e.g. drop all rows less than a certain value or all cells before some
timestamp).  To do this we're going to add a COMPACT command (see
issue 307<http://code.google.com/p/hypertable/issues/detail?id=307>
).
3. At some point we'll be adding a SNAPSHOT command.  Since CellStores can
be shared, cheap table snapshots can be made by first performing a
compaction and then copying the METADATA entries for the table being
snapshotted.

In all three of these situations, if a RangeServer goes down in the middle
of the operation, when one of the participating ranges in the operation is
recovered on a new RangeServer, it won't have enough state to finish the
operation.

The place where maintiaining a complex dependency graph is probably the most
important is in the RangeServer recovery process itself.  As Sanjit pointed
out, the recovery process needs to happen in roughly three stages, ROOT
range recovery, METADATA range recovery, and lastly USER range recovery.  If
you drill down into the RangeServer recovery process, it can be broken down
into the following four dependent steps:

1. Replay RSML and assign all the ranges to new RangeServers
2. Invoke RangeServer::replay_load_range() for each range assigned to a
RangeServer
3. Replay commit log fragments
4. Invoke the RangeServer::replay_commit() on each participating RangeServer

At any point in this process a RangeServer could go down and if it is one of
the RangeServers participating in the recovery, then things complicated.  As
a general rule, you can't carry out step #4 for USER ranges unless the
METADATA ranges are available and you can't carry out step #4 for METADATA
ranges until the ROOT range is available.  Ideally the recovery of each USER
range should only be dependent on the recovery of the METADATA range that
holds its corresponding entry.  Getting the recovery process implemented
optimally will be important for maintaining good 99th percentile latency in
large (thousands of nodes) deployments running on a cloud provider like EC2.

Given all these dependencies, a directed acyclic graph seems like a clean
and natural representation.

- Doug

On Fri, Jul 31, 2009 at 3:49 PM, Luke <[email protected]> wrote:

>
> On Fri, Jul 31, 2009 at 3:30 PM, Doug Judd<[email protected]> wrote:
> > The two examples that I gave are just two of many possible scenarios:
> >
> > - What if you are in the middle of DROP TABLE and a range server goes
> down
> > and then in the middle of that recovery another range server goes down,
> etc.
>
> It would still work if the schema is updated in hyperspace and
> remaining range servers, as long as the destination range servers are
> doing the right thing: ignoring ranges that belongs to a nonexistent
> table. The range server can cache negative results of schema queries
> for a while to avoid hitting hyperspace too much on unseen tables
> (range server should already do that already to avoid accidental ddos
> on hyperspace.)
>
> > - What if someone does an ALTER TABLE and then before it is complete, one
> of
> > the participating range servers goes down and then before the range
> server
> > recovers, one of the range servers participating in the recovery goes
> down
> > and then someone does a DROP table ...
>
> It really doesn't matter, hyperspace keeps the schema and the
> remaining range servers get the memo: alter table can finish on the
> remaining range servers. Same thing for the drop table. The key is to
> populate the correctly schema lazily when the ranges are recovered.
>
> > - What if the Master is in the middle of a MOVE operation and then the
> range
> > server that the range is being moved to goes down before it reports back,
> > then recovery starts and then someone does an ALTER TABLE?
>
> > - What if there are 16 overlapping range server failures?
>
> As long as the operations are idempotent, all you need to do is just
> updating schema on hyperpsace and remaining range server atomically.
>
> > Trying to program for all of the scenarios would be a total nightmare
> > without some sort of dependency abstraction.  Just because these
> scenarios
> > are rare, doesn't mean we can ignore them.  We have to come up with a
> clean
> > design that covers all the possible scenarios.
>
> The cleanest design doesn't have to handle these dependencies if
> possible. The ODG and related stuff would be a debugging nightmare.
>
> > This isn't over-engineering.  It is clean engineering.  There is no way
> to
> > cleanly design the Master without some sort of dependency abstraction.
>
> I still haven't seen a compelling example yet. I'll change my mind
> when I see it.
>
> __Luke
>
> > - Doug
> >
> > On Fri, Jul 31, 2009 at 2:54 PM, Luke <[email protected]> wrote:
> >>
> >> Master is getting more and more like a workqueue and jobtracker :) It
> >> seems to be advantageous to actually create a separate general server
> >> to manage all the tasks, which can be used for schedule map/reduce
> >> tasks in the future as well.
> >>
> >> On Fri, Jul 31, 2009 at 11:14 AM, Doug Judd<[email protected]> wrote:
> >> > The Master is responsible for orchestrating recovery from RangeServer
> >> > failures as well as carrying out meta operations in response to
> commands
> >> > such as CREATE TABLE, ALTER TABLE, and DROP TABLE.  These meta
> >> > operations
> >> > are relatively straightforward except in the face of RangeServer
> >> > failure.
> >> > When this happens, any in-progress meta operation that is dependent on
> >> > the
> >> > failed RangeServer needs to block until the RangeServer has been
> >> > recovered.
> >> > If another RangeServer that is involved in the recovery goes down,
> there
> >> > is
> >> > now another recovery operation that needs to get carried out. The
> Master
> >> > can
> >> > quickly start building up a fairly complex set of operation
> >> > dependencies.
> >> >
> >> > The master is also responsible for moving ranges from one RangeServer
> to
> >> > another when load across the RangeServers gets out of balance.  If a
> >> > MOVE
> >> > RANGE operation is in progress when, say, an ALTER TABLE request
> >> > arrives,
> >> > and the range being moved is part of the table specified in the ALTER
> >> > TABLE
> >> > request, then the ALTER TABLE operation needs to wait until the MOVE
> >> > RANGE
> >> > operation is complete before it can continue.  Also, if two ALTER
> TABLE
> >> > requests arrive at the Master at the same time, then they should get
> >> > carried
> >> > out in sequential order with one of the ALTER TABLE operations
> depending
> >> > on
> >> > the completion of the other operation.
> >>
> >> I'm not sure about this particular case. For alter table while ranges
> >> are split/moved, it seems to that me as long as you update the schema
> >> in hyperspace/range servers atomically. The split/moved ranges on the
> >> destination new server will get the right schema. Also two alter table
> >> can overlap in many cases, as long as the schema updates on
> >> hyperspace/range servers are atomic. For cases where alter table on
> >> the same table needs to be sequenced, it's actually not too much to
> >> ask the application to do the sequence, as alter table is not really a
> >> frequent operations (otherwise, they should go with a generic column
> >> family and go nuts on qualifiers.)
> >>
> >> > To handle these dependencies, I propose designing the Master as an
> >> > execution
> >> > engine for a directed acyclic graph of operations or operation
> >> > dependency
> >> > graph (ODG).  Each node in the graph would represent an operation
> (e.g.
> >> > ALTER TABLE, RECOVER RangeServer) and would contain dynamic state.
> >> > Execution threads would carry out the operations by picking up nodes
> >> > from
> >> > the graph in topological sort order.  When a RangeServer dies, the ODG
> >> > execution engine would pause, a new "RECOVER RangeServer" will get
> >> > created
> >> > and the ODG will get modified to include this new node.  All of the
> >> > existing
> >> > nodes that were dependent on that RangeServer would become dependent
> on
> >> > this
> >> > new RECOVER RangeServer node.  At this point the ODG execution engine
> >> > would
> >> > be restarted.
> >>
> >> The same alter table arguments can apply here as well. You can let the
> >> alter table to proceed on hyperspace and the remaining range servers.
> >> The recovered ranges would get the right schema. Otherwise, an alter
> >> table command can take a long time (up to a few minutes) while one of
> >> the range server is being recovered.
> >>
> >> > The Master Meta Log (MML) would essentially persist any changes to the
> >> > ODG,
> >> > both node state as well as structural graph changes.  When the Master
> >> > fails
> >> > and a new one comes up, it would replay the MML to reconstruct the ODG
> >> > after
> >> > which it could continue execution.
> >> >
> >> > Thoughts?
> >>
> >> It seems to me that an ODG is not absolutely required for normal
> >> Hypertable operations. I'd like to avoid over engineering (if
> >> possible) for the first release.
> >>
> >> __Luke
> >>
> >>
> >
> >
> > >
> >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Master Design Proposal - Operation Dependency Graph (ODG)

Reply via email to