Repository: hbase Updated Branches: refs/heads/master 2153a92fa -> 92c3b877c
HBASE-11476 Expand 'Conceptual View' section of Data Model chapter (Misty Stanley-Jones) Project: http://git-wip-us.apache.org/repos/asf/hbase/repo Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/92c3b877 Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/92c3b877 Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/92c3b877 Branch: refs/heads/master Commit: 92c3b877c0a2f1ca0fa6c791e41fbcb889f220ad Parents: 2153a92 Author: Jonathan M Hsieh <[email protected]> Authored: Wed Aug 13 14:57:16 2014 -0700 Committer: Jonathan M Hsieh <[email protected]> Committed: Wed Aug 13 14:57:16 2014 -0700 ---------------------------------------------------------------------- src/main/docbkx/book.xml | 342 +++++++++++++++++++++++++++++------------- 1 file changed, 240 insertions(+), 102 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hbase/blob/92c3b877/src/main/docbkx/book.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index d37537f..603839c 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -91,38 +91,129 @@ <chapter xml:id="datamodel"> <title>Data Model</title> - <para>In short, applications store data into an HBase table. Tables are made of rows and - columns. All columns in HBase belong to a particular column family. Table cells -- the - intersection of row and column coordinates -- are versioned. A cellâs content is an - uninterpreted array of bytes. </para> - <para>Table row keys are also byte arrays so almost anything can serve as a row key from strings - to binary representations of longs or even serialized data structures. Rows in HBase tables - are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key - -- its primary key. </para> + <para>In HBase, data is stored in tables, which have rows and columns. This is a terminology + overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can + be helpful to think of an HBase table as a multi-dimensional map.</para> + <variablelist> + <title>HBase Data Model Terminology</title> + <varlistentry> + <term>Table</term> + <listitem> + <para>An HBase table consists of multiple rows.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Row</term> + <listitem> + <para>A row in HBase consists of a row key and one or more columns with values associated + with them. Rows are sorted alphabetically by the row key as they are stored. For this + reason, the design of the row key is very important. The goal is to store data in such a + way that related rows are near each other. A common row key pattern is a website domain. + If your row keys are domains, you should probably store them in reverse (org.apache.www, + org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each + other in the table, rather than being spread out based on the first letter of the + subdomain.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Column</term> + <listitem> + <para>A column in HBase consists of a column family and a column qualifier, which are + delimited by a <literal>:</literal> (colon) character.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Column Family</term> + <listitem> + <para>Column families physically colocate a set of columns and their values, often for + performance reasons. Each column family has a set of storage properties, such as whether + its values should be cached in memory, how its data is compressed or its row keys are + encoded, and others. Each row in a table has the same column + families, though a given row might not store anything in a given column family.</para> + <para>Column families are specified when you create your table, and influence the way your + data is stored in the underlying filesystem. Therefore, the column families should be + considered carefully during schema design.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Column Qualifier</term> + <listitem> + <para>A column qualifier is added to a column family to provide the index for a given + piece of data. Given a column family <literal>content</literal>, a column qualifier + might be <literal>content:html</literal>, and another might be + <literal>content:pdf</literal>. Though column families are fixed at table creation, + column qualifiers are mutable and may differ greatly between rows.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Cell</term> + <listitem> + <para>A cell is a combination of row, column family, and column qualifier, and contains a + value and a timestamp, which represents the value's version.</para> + <para>A cell's value is an uninterpreted array of bytes.</para> + </listitem> + </varlistentry> + <varlistentry> + <term>Timestamp</term> + <listitem> + <para>A timestamp is written alongside each value, and is the identifier for a given + version of a value. By default, the timestamp represents the time on the RegionServer + when the data was written, but you can specify a different timestamp value when you put + data into the cell.</para> + <caution> + <para>Direct manipulation of timestamps is an advanced feature which is only exposed for + special cases that are deeply integrated with HBase, and is discouraged in general. + Encoding a timestamp at the application level is the preferred pattern.</para> + </caution> + <para>You can specify the maximum number of versions of a value that HBase retains, per column + family. When the maximum number of versions is reached, the oldest versions are + eventually deleted. By default, only the newest version is kept.</para> + </listitem> + </varlistentry> + </variablelist> <section xml:id="conceptual.view"> <title>Conceptual View</title> + <para>You can read a very understandable explanation of the HBase data model in the blog post <link + xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding + HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the + PDF <link + xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction + to Basic Schema Design</link> by Amandeep Khurana. It may help to read different + perspectives to get a solid understanding of HBase schema design. The linked articles cover + the same ground as the information in this section.</para> <para> The following example is a slightly modified form of the one on page 2 of the <link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There - is a table called <varname>webtable</varname> that contains two column families named - <varname>contents</varname> and <varname>anchor</varname>. In this example, + is a table called <varname>webtable</varname> that contains two rows + (<literal>com.cnn.www</literal> + and <literal>com.example.www</literal>), three column families named + <varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In + this example, for the first row (<literal>com.cnn.www</literal>), <varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>, <varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column - (<varname>contents:html</varname>). <note> + (<varname>contents:html</varname>). This example contains 5 versions of the row with the + row key <literal>com.cnn.www</literal>, and one version of the row with the row key + <literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire + HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each + contain the external site which links to the site represented by the row, along with the + text it used in the anchor of its link. The <varname>people</varname> column family represents + people associated with the site. + </para> + <note> <title>Column Names</title> - <para> By convention, a column name is made of its column family prefix and a - <emphasis>qualifier</emphasis>. For example, the column - <emphasis>contents:html</emphasis> is made up of the column family - <varname>contents</varname> and <varname>html</varname> qualifier. The colon character - (<literal>:</literal>) delimits the column family from the column family - <emphasis>qualifier</emphasis>. </para> + <para> By convention, a column name is made of its column family prefix and a + <emphasis>qualifier</emphasis>. For example, the column + <emphasis>contents:html</emphasis> is made up of the column family + <varname>contents</varname> and the <varname>html</varname> qualifier. The colon + character (<literal>:</literal>) delimits the column family from the column family + <emphasis>qualifier</emphasis>. </para> </note> <table frame="all"> <title>Table <varname>webtable</varname></title> <tgroup - cols="4" + cols="5" align="left" colsep="1" rowsep="1"> @@ -134,12 +225,15 @@ colname="c3" /> <colspec colname="c4" /> + <colspec + colname="c5" /> <thead> <row> <entry>Row Key</entry> <entry>Time Stamp</entry> <entry>ColumnFamily <varname>contents</varname></entry> <entry>ColumnFamily <varname>anchor</varname></entry> + <entry>ColumnFamily <varname>people</varname></entry> </row> </thead> <tbody> @@ -148,128 +242,172 @@ <entry>t9</entry> <entry /> <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry> + <entry /> </row> <row> <entry>"com.cnn.www"</entry> <entry>t8</entry> <entry /> <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry> + <entry /> </row> <row> <entry>"com.cnn.www"</entry> <entry>t6</entry> <entry><varname>contents:html</varname> = "<html>..."</entry> <entry /> + <entry /> </row> <row> <entry>"com.cnn.www"</entry> <entry>t5</entry> <entry><varname>contents:html</varname> = "<html>..."</entry> <entry /> + <entry /> </row> <row> <entry>"com.cnn.www"</entry> <entry>t3</entry> <entry><varname>contents:html</varname> = "<html>..."</entry> <entry /> + <entry /> + </row> + <row> + <entry>"com.example.www"</entry> + <entry>t5</entry> + <entry><varname>contents:html</varname> = "<html>..."</entry> + <entry></entry> + <entry>people:author = "John Doe"</entry> </row> </tbody> </tgroup> </table> - </para> + <para>Cells in this table that appear to be empty do not take space, or in fact exist, in + HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to + look at data in HBase, or even the most accurate. The following represents the same + information as a multi-dimensional map. This is only a mock-up for illustrative + purposes and may not be strictly accurate.</para> + <programlisting><![CDATA[ +{ + "com.cnn.www": { + contents: { + t6: contents:html: "<html>..." + t5: contents:html: "<html>..." + t3: contents:html: "<html>..." + } + anchor: { + t9: anchor:cnnsi.com = "CNN" + t8: anchor:my.look.ca = "CNN.com" + } + people: {} + } + "com.example.www": { + contents: { + t5: contents:html: "<html>..." + } + anchor: {} + people: { + t5: people:author: "John Doe" + } + } +} + ]]></programlisting> + </section> <section xml:id="physical.view"> <title>Physical View</title> - <para> Although at a conceptual level tables may be viewed as a sparse set of rows. Physically - they are stored on a per-column family basis. New columns (i.e., - <varname>columnfamily:column</varname>) can be added to any column family without - pre-announcing them. <table - frame="all"> - <title>ColumnFamily <varname>anchor</varname></title> - <tgroup - cols="3" - align="left" - colsep="1" - rowsep="1"> - <colspec - colname="c1" /> - <colspec - colname="c2" /> - <colspec - colname="c3" /> - <thead> - <row> - <entry>Row Key</entry> - <entry>Time Stamp</entry> - <entry>Column Family <varname>anchor</varname></entry> - </row> - </thead> - <tbody> - <row> - <entry>"com.cnn.www"</entry> - <entry>t9</entry> - <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry> - </row> - <row> - <entry>"com.cnn.www"</entry> - <entry>t8</entry> - <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry> - </row> - </tbody> - </tgroup> - </table> - <table - frame="all"> - <title>ColumnFamily <varname>contents</varname></title> - <tgroup - cols="3" - align="left" - colsep="1" - rowsep="1"> - <colspec - colname="c1" /> - <colspec - colname="c2" /> - <colspec - colname="c3" /> - <thead> - <row> - <entry>Row Key</entry> - <entry>Time Stamp</entry> - <entry>ColumnFamily "contents:"</entry> - </row> - </thead> - <tbody> - <row> - <entry>"com.cnn.www"</entry> - <entry>t6</entry> - <entry><varname>contents:html</varname> = "<html>..."</entry> - </row> - <row> - <entry>"com.cnn.www"</entry> - <entry>t5</entry> - <entry><varname>contents:html</varname> = "<html>..."</entry> - </row> - <row> - <entry>"com.cnn.www"</entry> - <entry>t3</entry> - <entry><varname>contents:html</varname> = "<html>..."</entry> - </row> - </tbody> - </tgroup> - </table> It is important to note in the diagram above that the empty cells shown in the - conceptual view are not stored since they need not be in a column-oriented storage format. + <para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are + physically stored by column family. A new column qualifier (column_family:column_qualifier) + can be added to an existing column family at any time.</para> + <table + frame="all"> + <title>ColumnFamily <varname>anchor</varname></title> + <tgroup + cols="3" + align="left" + colsep="1" + rowsep="1"> + <colspec + colname="c1" /> + <colspec + colname="c2" /> + <colspec + colname="c3" /> + <thead> + <row> + <entry>Row Key</entry> + <entry>Time Stamp</entry> + <entry>Column Family <varname>anchor</varname></entry> + </row> + </thead> + <tbody> + <row> + <entry>"com.cnn.www"</entry> + <entry>t9</entry> + <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry> + </row> + <row> + <entry>"com.cnn.www"</entry> + <entry>t8</entry> + <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry> + </row> + </tbody> + </tgroup> + </table> + <table + frame="all"> + <title>ColumnFamily <varname>contents</varname></title> + <tgroup + cols="3" + align="left" + colsep="1" + rowsep="1"> + <colspec + colname="c1" /> + <colspec + colname="c2" /> + <colspec + colname="c3" /> + <thead> + <row> + <entry>Row Key</entry> + <entry>Time Stamp</entry> + <entry>ColumnFamily "contents:"</entry> + </row> + </thead> + <tbody> + <row> + <entry>"com.cnn.www"</entry> + <entry>t6</entry> + <entry><varname>contents:html</varname> = "<html>..."</entry> + </row> + <row> + <entry>"com.cnn.www"</entry> + <entry>t5</entry> + <entry><varname>contents:html</varname> = "<html>..."</entry> + </row> + <row> + <entry>"com.cnn.www"</entry> + <entry>t3</entry> + <entry><varname>contents:html</varname> = "<html>..."</entry> + </row> + </tbody> + </tgroup> + </table> + <para>The empty cells shown in the + conceptual view are not stored at all. Thus a request for the value of the <varname>contents:html</varname> column at time stamp <literal>t8</literal> would return no value. Similarly, a request for an <varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would return no value. However, if no timestamp is supplied, the most recent value for a - particular column would be returned and would also be the first one found since timestamps + particular column would be returned. Given multiple versions, the most recent is also the + first one found, since timestamps are stored in descending order. Thus a request for the values of all columns in the row <varname>com.cnn.www</varname> if no timestamp is specified would be: the value of - <varname>contents:html</varname> from time stamp <literal>t6</literal>, the value of - <varname>anchor:cnnsi.com</varname> from time stamp <literal>t9</literal>, the value of - <varname>anchor:my.look.ca</varname> from time stamp <literal>t8</literal>. </para> + <varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of + <varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of + <varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para> <para>For more information about the internals of how Apache HBase stores data, see <xref linkend="regions.arch" />. </para> </section>
