That is some of the finest art seen by me in a long time. We're located
close to MoMA. I'm going to see if we can get you an installation.
Answers inline.
Krzysztof Szlapinski wrote:
hi all,
to better understand how hbase works i started reading this document
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
and created some diagrams
here they are (png, and svg for editing):
1) habase hierarchy of objects:
http://www.starline.com.pl/hbase/habase_hierarchy.png
http://www.starline.com.pl/hbase/habase_hierarchy.svg
I'd suggest that Master, Client and RegionServer be peers rather than
arranged hierarchically. The client talks but rarely tot he master only
to ask it where the catalog tables are located. Thereafter it talks
exclusively with the regionserers. Have arrows going from the cilent to
both the master and the regionserver.
2) hbase architecture (relations between objects)
http://www.starline.com.pl/hbase/habase_architecture.png
http://www.starline.com.pl/hbase/habase_architecture.svg
Same as comment above.
3) visual representation flush cache operation
http://www.starline.com.pl/hbase/hbase_flush_cache.png
http://www.starline.com.pl/hbase/hbase_flush_cache.svg
Here, flushes are done from the memcache. The diagram doesn't give this
impression.
since the documentation says that its information may be out of date
please feel free to comment on these diagrams, update them, put them
on your sites etc
i got a question too
lets say we have cluster of 3 machines:
- 1 master + region server,and
- 2 region servers
on each machine I got web server that connects to hbase client to get
and get information out from hbase
it is not clear to me where should these clients connect to
should all clients connect directly and only to the master, which will
tell them on which region server is the information they are looking for?
or can they connect to the region servers and if the information they
are looking for in not in them region servers will contact master and
fetch there information for the client?
You almost have it.
A client that wants to insert row X into table A needs to figure which
region of table A the row X belongs too. This information is kept in
the .META. table. It is a listing of all regions for all tables keyed
by table and the first row in a region sorted lexicographically. The
regions that make up the .META. table table are themselves kept in a
special catalog table, the -ROOT- table.
A fresh client -- one that has just started and so has an empty cache --
goes first to the master to ask it where the root region is hosted.
Once it has the address of the regionserver hosting the root region, it
caches it, and then it goes to the hosting regionserver to read the
location of the .META. table region that has the row that contains the
region of table A into which X should be inserted. The client goes to
the .META. region hosting server after caching its location and reads
location of the region from table A where it should insert X.
Finally it goes to server hosting table A's region and inserts X.
Over time, cilent builds up a cache of where regions are located and
will rely on this information rather than travel the net to read
locations every time it needs to find a region -- until there is a
fault. At that time, it will back up the hierarchy of region locations
to fix its list of locations and then away it goes again.
Check out the Bigtable paper. It does better explaination than I of how
this all works.
St.Ack
krzysiek