Re: HBase Design Ideas, Part I

Stefan Groschupf Mon, 15 May 2006 02:54:07 -0700


Hi,
sounds pretty much similar to what I was thinking about,

just that I had used different terms and you description is much moreelegant than my hand written notes.

Some comments below.

--------------------------------------------------------------------------------
I.  Table semantics
An HBase consists of one or more HTables. An HTable is a list ofrows,
sorted alphabetically by "row name".  An HTable also has a series of
"columns."  A row may or may not contain a value for a column.  The
HTable representation is sparse, so if a row does not contain a value
for a given column, there is no storage overhead.
(Thus, there's not really a "schema" to an HTable. Everyoperation, even
adding a column, is considered a row-centric operation.)
The "current version" of a row is always available, timestampedwith itslast modification date. The system may also store previousversions of a row,
according to how the HTable is configured.

I was playing around with row and run in to several problems usingthe hadoop io package. (SequenceReader writer)Optimal would be if a cell is a writable but having rowkey and cellkey and value for each sell blows up disk usage.Alternative we can have a row writable so we only one rowkey , ncolumn key and n values.In case a row has many column this scales very bad. For example myrow key is a url and my column keys are user ids and the value arenumber of clicks.if I want to get the number of clicks for a given url and user, Ineed to load the values for all other user as well. :(


I already posted a mail about this issue.

What we may be need is a Writer that can seek first for row key andthan for column keys.

In general I agree with  sparse structure.

Updates to a single row are always atomic and can affect one ormore columns.
II.  System layout

HTables are partitionable into contiguous row regions called HRegions.
All machines in a pool run an HRegionServer. A given HRegion isserved
to clients by a single HRegionServer.  A single HRegionServer may be
responsible for many HRegions.  The HRegions for a single HTable will
be scattered across arbitrary HRegionServers.
When a client wants to add/delete/update a row value, it mustlocate therelevant HRegionServer. It then contacts the HRegionServer andcommunicatesthe updates. There may be other steps, mainly lock-oriented ones.But locating
the relevant HRegionServers is a bare minimum.

My idea was to have the lock on the HRegionServer level, my ideas wasthat the client itself take care about replication,means write the value to n servers that have the same replicatins ofHRegions.

The HBase system can repartition an HTable at any time. Forexample, manyrepeated inserts at a single location may cause a single HRegion togrowvery large. The HBase would then try to split that into multipleHRegions.
Those HRegions may be served by the same HRegionServer as the
original or may be served by a different one.

Would the node send out a message to request a split or does themaster decide based on heart beat messages?

Each HRegionServer sends a regular heartbeat to an HBaseMastermachine.If the heartbeat for an HRegionServer fails, then the HBaseMasteris responsible
for reassigning its HRegions to other available HRegionServers.
All HRegions are stored within DFS, so the HRegion is alwaysavailable, evenin the face of machine failures. The HRegionServers and DFSDataNodes runon the same set of machines. We would like for an HRegionServer toalwaysserve data stored locally, but that is not guaranteed when usingDFS. We can
encourage it by:
1) In the event of an insert-motivated HRegion move, the newHRegionServershould always create a new DFS file for the new HRegion. The DFSrules of
thumb will allocate the chunks locally for the HRegionServer.
2) In the even of a machine failure, we cannot do anything similarto above.Instead, the HBaseMaster can ask DFS for hints as to where therelevant
file blocks are stored.  If possible, it will allocate the new
HRegions to servers
that physically contain the HRegion.
3) If necessary, we could add an API to DFS that demands blockreplication
to a given node.  I'd like to avoid this if possible.

My idea was to simply download the data to the node and read any timelocally, but write into the dfs, since in my case write access can beslower but I needer very fast read access.

The mapping from row to HRegion (and hence, to HRegionServer) isitselfstored in a special HTable. The HBaseMaster is the only clientallowed towrite to this HTable. This special HTable may itself be split intoseveral
HRegions.  However, we only allow a hard-coded number of split-levels.
The top level of this hierarchy must be easily-stored on a singlemachine.
That top-level table is always served by the HBaseMaster itself.

III.  Client behavior

Let's think about what happens when a client wants to add a row.
1) The client must compute what HRegion is responsible for the key
it wants to insert into the HTable.  It must navigate the row->HRegion
mapping, which is stored in an HTable.
So the client first contacts the HBaseMaster for the top-leveltable contents.It then steps downward through the table set, until it finds themapping for
the target row.
2) The client contacts the HRegionServer responsible for the targetrow,and asks to insert. If the HRegionServer is no longer responsiblefor therelevant HRegion, it returns a failure message and tells the clientto go
back to step 1 to find the new correct HRegionServer.

My idea was in such a case the HRegionServer may be know the newlocation at least until the master is informed.So getting a forward message could be faster than get an error andtry ask for the target again.

If the HRegionServer is the right place to go, it accepts the newrow fromthe client. The HRegionServer guarantees that the insert isatomic; itwill not intermingle the insert with a competing insert for thesame row key.When the row is stored, the HRegionServer includes version andtimestamp

information.

3) That's it!

IV The HRegionServer

Maintaining the data for a single HRegion is slightly complicated.It's

especially weird given the write-once semantics of DFS.  There are
three important moving parts:

1) HBackedStore is a file-backed store for rows and their values.
It is never edited in place.  It has B-Tree-like lookups for finding
a row quickly.  HBackedStore is actually a series of on-disk stores,

each store being tuned for a certain object size. Thus, all the"small"

(in bytes) values for a row live within the same file, all the medium
ones live in a separate file, etc.  There is only one HBackedStore
for any single HRegion.

2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
by an on-disk file.  When making reads from the HBackedStore, it may
be necessary to consult the HUpdateLog to see if any more-recent
updates have been made.  There may be a series of HUpdateLogs
for a single HRegion.

3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
to be consulted whenever performing a read.  There is only one
HUpdateBuf for a single HRegion.

Any incoming edit is first made directly to the HUpdateBuf.  Changes
made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
The rate of flushes is an admin-configurable parameter.

Periodically, the HBackedStore and the series of current HUpdateLogs

are merged to form a new HBackedStore. At that point, the oldHUpdateLog

objects can be destroyed.  During this compaction process, edits are
made to the HUpdateBuf.


Sounds great!
Looking forward to see that working.
Stefan

Re: HBase Design Ideas, Part I

Reply via email to