http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/docbkx/rpc.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/rpc.xml b/src/main/docbkx/rpc.xml deleted file mode 100644 index 2e5dd5f..0000000 --- a/src/main/docbkx/rpc.xml +++ /dev/null @@ -1,301 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<appendix - xml:id="hbase.rpc" - version="5.0" - xmlns="http://docbook.org/ns/docbook" - xmlns:xlink="http://www.w3.org/1999/xlink" - xmlns:xi="http://www.w3.org/2001/XInclude" - xmlns:svg="http://www.w3.org/2000/svg" - xmlns:m="http://www.w3.org/1998/Math/MathML" - xmlns:html="http://www.w3.org/1999/xhtml" - xmlns:db="http://docbook.org/ns/docbook"> - <!--/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ ---> - - <title>0.95 RPC Specification</title> - <para>In 0.95, all client/server communication is done with <link - xlink:href="https://code.google.com/p/protobuf/">protobufâed</link> Messages rather than - with <link - xlink:href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html">Hadoop - Writables</link>. Our RPC wire format therefore changes. This document describes the - client/server request/response protocol and our new RPC wire-format.</para> - <para /> - <para>For what RPC is like in 0.94 and previous, see Benoît/Tsunaâs <link - xlink:href="https://github.com/OpenTSDB/asynchbase/blob/master/src/HBaseRpc.java#L164">Unofficial - Hadoop / HBase RPC protocol documentation</link>. For more background on how we arrived - at this spec., see <link - xlink:href="https://docs.google.com/document/d/1WCKwgaLDqBw2vpux0jPsAu2WPTRISob7HGCO8YhfDTA/edit#">HBase - RPC: WIP</link></para> - <para /> - <section> - <title>Goals</title> - <para> - <orderedlist> - <listitem> - <para>A wire-format we can evolve</para> - </listitem> - <listitem> - <para>A format that does not require our rewriting server core or radically - changing its current architecture (for later).</para> - </listitem> - </orderedlist> - </para> - </section> - <section> - <title>TODO</title> - <para> - <orderedlist> - <listitem> - <para>List of problems with currently specified format and where we would like - to go in a version2, etc. For example, what would we have to change if - anything to move server async or to support streaming/chunking?</para> - </listitem> - <listitem> - <para>Diagram on how it works</para> - </listitem> - <listitem> - <para>A grammar that succinctly describes the wire-format. Currently we have - these words and the content of the rpc protobuf idl but a grammar for the - back and forth would help with groking rpc. Also, a little state machine on - client/server interactions would help with understanding (and ensuring - correct implementation).</para> - </listitem> - </orderedlist> - </para> - </section> - <section> - <title>RPC</title> - <para>The client will send setup information on connection establish. Thereafter, the client - invokes methods against the remote server sending a protobuf Message and receiving a - protobuf Message in response. Communication is synchronous. All back and forth is - preceded by an int that has the total length of the request/response. Optionally, - Cells(KeyValues) can be passed outside of protobufs in follow-behind Cell blocks - (because <link - xlink:href="https://docs.google.com/document/d/1WEtrq-JTIUhlnlnvA0oYRLp0F8MKpEBeBSCFcQiacdw/edit#">we - canât protobuf megabytes of KeyValues</link> or Cells). These CellBlocks are encoded - and optionally compressed.</para> - <para /> - <para>For more detail on the protobufs involved, see the <link - xlink:href="http://svn.apache.org/viewvc/hbase/trunk/hbase-protocol/src/main/protobuf/RPC.proto?view=markup">RPC.proto</link> - file in trunk.</para> - - <section> - <title>Connection Setup</title> - <para>Client initiates connection.</para> - <section> - <title>Client</title> - <para>On connection setup, client sends a preamble followed by a connection header. </para> - - <section> - <title><preamble></title> - <programlisting><MAGIC 4 byte integer> <1 byte RPC Format Version> <1 byte auth type></programlisting> - <para> We need the auth method spec. here so the connection header is encoded if auth enabled.</para> - <para>E.g.: HBas0x000x50 -- 4 bytes of MAGIC -- âHBasâ -- plus one-byte of - version, 0 in this case, and one byte, 0x50 (SIMPLE). of an auth - type.</para> - </section> - - <section> - <title><Protobuf ConnectionHeader Message></title> - <para>Has user info, and âprotocolâ, as well as the encoders and compression the - client will use sending CellBlocks. CellBlock encoders and compressors are - for the life of the connection. CellBlock encoders implement - org.apache.hadoop.hbase.codec.Codec. CellBlocks may then also be compressed. - Compressors implement org.apache.hadoop.io.compress.CompressionCodec. This - protobuf is written using writeDelimited so is prefaced by a pb varint with - its serialized length</para> - </section> - </section> - <!--Client--> - - <section> - <title>Server</title> - <para>After client sends preamble and connection header, server does NOT respond if - successful connection setup. No response means server is READY to accept - requests and to give out response. If the version or authentication in the - preamble is not agreeable or the server has trouble parsing the preamble, it - will throw a org.apache.hadoop.hbase.ipc.FatalConnectionException explaining the - error and will then disconnect. If the client in the connection header -- i.e. - the protobufâd Message that comes after the connection preamble -- asks for for - a Service the server does not support or a codec the server does not have, again - we throw a FatalConnectionException with explanation.</para> - </section> - </section> - - <section> - <title>Request</title> - <para>After a Connection has been set up, client makes requests. Server responds.</para> - <para>A request is made up of a protobuf RequestHeader followed by a protobuf Message - parameter. The header includes the method name and optionally, metadata on the - optional CellBlock that may be following. The parameter type suits the method being - invoked: i.e. if we are doing a getRegionInfo request, the protobuf Message param - will be an instance of GetRegionInfoRequest. The response will be a - GetRegionInfoResponse. The CellBlock is optionally used ferrying the bulk of the RPC - data: i.e Cells/KeyValues.</para> - <section> - <title>Request Parts</title> - <section> - <title><Total Length></title> - <para>The request is prefaced by an int that holds the total length of what - follows.</para> - </section> - <section> - <title><Protobuf RequestHeader Message></title> - <para>Will have call.id, trace.id, and method name, etc. including optional - Metadata on the Cell block IFF one is following. Data is protobufâd inline - in this pb Message or optionally comes in the following CellBlock</para> - </section> - <section> - <title><Protobuf Param Message></title> - <para>If the method being invoked is getRegionInfo, if you study the Service - descriptor for the client to regionserver protocol, you will find that the - request sends a GetRegionInfoRequest protobuf Message param in this - position.</para> - </section> - <section> - <title><CellBlock></title> - <para>An encoded and optionally compressed Cell block.</para> - </section> - </section> - <!--Request parts--> - </section> - <!--Request--> - - <section> - <title>Response</title> - <para>Same as Request, it is a protobuf ResponseHeader followed by a protobuf Message - response where the Message response type suits the method invoked. Bulk of the data - may come in a following CellBlock.</para> - <section> - <title>Response Parts</title> - <section> - <title><Total Length></title> - <para>The response is prefaced by an int that holds the total length of what - follows.</para> - </section> - <section> - <title><Protobuf ResponseHeader Message></title> - <para>Will have call.id, etc. Will include exception if failed processing. -  Optionally includes metadata on optional, IFF there is a CellBlock - following.</para> - </section> - - <section> - <title><Protobuf Response Message></title> - <para>Return or may be nothing if exception. If the method being invoked is - getRegionInfo, if you study the Service descriptor for the client to - regionserver protocol, you will find that the response sends a - GetRegionInfoResponse protobuf Message param in this position.</para> - </section> - <section> - <title><CellBlock></title> - <para>An encoded and optionally compressed Cell block.</para> - </section> - </section> - <!--Parts--> - </section> - <!--Response--> - - <section> - <title>Exceptions</title> - <para>There are two distinct types. There is the request failed which is encapsulated - inside the response header for the response. The connection stays open to receive - new requests. The second type, the FatalConnectionException, kills the - connection.</para> - <para>Exceptions can carry extra information. See the ExceptionResponse protobuf type. - It has a flag to indicate do-no-retry as well as other miscellaneous payload to help - improve client responsiveness.</para> - </section> - <section> - <title>CellBlocks</title> - <para>These are not versioned. Server can do the codec or it cannot. If new version of a - codec with say, tighter encoding, then give it a new class name. Codecs will live on - the server for all time so old clients can connect.</para> - </section> - </section> - - - <section> - <title>Notes</title> - <section> - <title>Constraints</title> - <para>In some part, current wire-format -- i.e. all requests and responses preceeded by - a length -- has been dictated by current server non-async architecture.</para> - </section> - <section> - <title>One fat pb request or header+param</title> - <para>We went with pb header followed by pb param making a request and a pb header - followed by pb response for now. Doing header+param rather than a single protobuf - Message with both header and param content:</para> - <para> - <orderedlist> - <listitem> - <para>Is closer to what we currently have</para> - </listitem> - <listitem> - <para>Having a single fat pb requires extra copying putting the already pbâd - param into the body of the fat request pb (and same making - result)</para> - </listitem> - <listitem> - <para>We can decide whether to accept the request or not before we read the - param; for example, the request might be low priority.  As is, we read - header+param in one go as server is currently implemented so this is a - TODO.</para> - </listitem> - </orderedlist> - </para> - <para>The advantages are minor.  If later, fat request has clear advantage, can roll out - a v2 later.</para> - </section> - <section - xml:id="rpc.configs"> - <title>RPC Configurations</title> - <section> - <title>CellBlock Codecs</title> - <para>To enable a codec other than the default <classname>KeyValueCodec</classname>, - set <varname>hbase.client.rpc.codec</varname> to the name of the Codec class to - use. Codec must implement hbase's <classname>Codec</classname> Interface. After - connection setup, all passed cellblocks will be sent with this codec. The server - will return cellblocks using this same codec as long as the codec is on the - servers' CLASSPATH (else you will get - <classname>UnsupportedCellCodecException</classname>).</para> - <para>To change the default codec, set - <varname>hbase.client.default.rpc.codec</varname>. </para> - <para>To disable cellblocks completely and to go pure protobuf, set the default to - the empty String and do not specify a codec in your Configuration. So, set - <varname>hbase.client.default.rpc.codec</varname> to the empty string and do - not set <varname>hbase.client.rpc.codec</varname>. This will cause the client to - connect to the server with no codec specified. If a server sees no codec, it - will return all responses in pure protobuf. Running pure protobuf all the time - will be slower than running with cellblocks. </para> - </section> - <section> - <title>Compression</title> - <para>Uses hadoops compression codecs. To enable compressing of passed CellBlocks, - set <varname>hbase.client.rpc.compressor</varname> to the name of the Compressor - to use. Compressor must implement Hadoops' CompressionCodec Interface. After - connection setup, all passed cellblocks will be sent compressed. The server will - return cellblocks compressed using this same compressor as long as the - compressor is on its CLASSPATH (else you will get - <classname>UnsupportedCompressionCodecException</classname>).</para> - </section> - </section> - </section> -</appendix>
http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/docbkx/schema_design.xml ---------------------------------------------------------------------- diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml deleted file mode 100644 index e4632ec..0000000 --- a/src/main/docbkx/schema_design.xml +++ /dev/null @@ -1,1262 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<chapter - version="5.0" - xml:id="schema" - xmlns="http://docbook.org/ns/docbook" - xmlns:xlink="http://www.w3.org/1999/xlink" - xmlns:xi="http://www.w3.org/2001/XInclude" - xmlns:svg="http://www.w3.org/2000/svg" - xmlns:m="http://www.w3.org/1998/Math/MathML" - xmlns:html="http://www.w3.org/1999/xhtml" - xmlns:db="http://docbook.org/ns/docbook"> - <!-- -/** - * - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ ---> - <title>HBase and Schema Design</title> - <para>A good general introduction on the strength and weaknesses modelling on the various - non-rdbms datastores is Ian Varley's Master thesis, <link - xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: - The Mixed Blessings of Non-Relational Databases</link>. Recommended. Also, read <xref - linkend="keyvalue" /> for how HBase stores data internally, and the section on <xref - linkend="schema.casestudies" />. </para> - <section - xml:id="schema.creation"> - <title> Schema Creation </title> - <para>HBase schemas can be created or updated with <xref - linkend="shell" /> or by using <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html">HBaseAdmin</link> - in the Java API. </para> - <para>Tables must be disabled when making ColumnFamily modifications, for example:</para> - <programlisting language="java"> -Configuration config = HBaseConfiguration.create(); -HBaseAdmin admin = new HBaseAdmin(conf); -String table = "myTable"; - -admin.disableTable(table); - -HColumnDescriptor cf1 = ...; -admin.addColumn(table, cf1); // adding new ColumnFamily -HColumnDescriptor cf2 = ...; -admin.modifyColumn(table, cf2); // modifying existing ColumnFamily - -admin.enableTable(table); - </programlisting> - <para>See <xref - linkend="client_dependencies" /> for more information about configuring client - connections.</para> - <para>Note: online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase - requires the table to be disabled. </para> - <section - xml:id="schema.updates"> - <title>Schema Updates</title> - <para>When changes are made to either Tables or ColumnFamilies (e.g., region size, block - size), these changes take effect the next time there is a major compaction and the - StoreFiles get re-written. </para> - <para>See <xref - linkend="store" /> for more information on StoreFiles. </para> - </section> - </section> - <section - xml:id="number.of.cfs"> - <title> On the number of column families </title> - <para> HBase currently does not do well with anything above two or three column families so keep - the number of column families in your schema low. Currently, flushing and compactions are done - on a per Region basis so if one column family is carrying the bulk of the data bringing on - flushes, the adjacent families will also be flushed though the amount of data they carry is - small. When many column families the flushing and compaction interaction can make for a bunch - of needless i/o loading (To be addressed by changing flushing and compaction to work on a per - column family basis). For more information on compactions, see <xref - linkend="compaction" />. </para> - <para>Try to make do with one column family if you can in your schemas. Only introduce a second - and third column family in the case where data access is usually column scoped; i.e. you query - one column family or the other but usually not both at the one time. </para> - <section - xml:id="number.of.cfs.card"> - <title>Cardinality of ColumnFamilies</title> - <para>Where multiple ColumnFamilies exist in a single table, be aware of the cardinality - (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion - rows, ColumnFamilyA's data will likely be spread across many, many regions (and - RegionServers). This makes mass scans for ColumnFamilyA less efficient. </para> - </section> - </section> - <section - xml:id="rowkey.design"> - <title>Rowkey Design</title> - <section> - <title>Hotspotting</title> - <para>Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, - allowing you to store related rows, or rows that will be read together, near each other. - However, poorly designed row keys are a common source of <firstterm>hotspotting</firstterm>. - Hotspotting occurs when a large amount of client traffic is directed at one node, or only a - few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The - traffic overwhelms the single machine responsible for hosting that region, causing - performance degradation and potentially leading to region unavailability. This can also have - adverse effects on other regions hosted by the same region server as that host is unable to - service the requested load. It is important to design data access patterns such that the - cluster is fully and evenly utilized.</para> - <para>To prevent hotspotting on writes, design your row keys such that rows that truly do need - to be in the same region are, but in the bigger picture, data is being written to multiple - regions across the cluster, rather than one at a time. Some common techniques for avoiding - hotspotting are described below, along with some of their advantages and drawbacks.</para> - <formalpara> - <title>Salting</title> - <para>Salting in this sense has nothing to do with cryptography, but refers to adding random - data to the start of a row key. In this case, salting refers to adding a randomly-assigned - prefix to the row key to cause it to sort differently than it otherwise would. The number - of possible prefixes correspond to the number of regions you want to spread the data - across. Salting can be helpful if you have a few "hot" row key patterns which come up over - and over amongst other more evenly-distributed rows. Consider the following example, which - shows that salting can spread write load across multiple regionservers, and illustrates - some of the negative implications for reads.</para> - </formalpara> - <example> - <title>Salting Example</title> - <para>Suppose you have the following list of row keys, and your table is split such that - there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' - is another. In this table, all rows starting with 'f' are in the same region. This example - focuses on rows with keys like the following:</para> - <screen> -foo0001 -foo0002 -foo0003 -foo0004 - </screen> - <para>Now, imagine that you would like to spread these across four different regions. You - decide to use four different salts: <literal>a</literal>, <literal>b</literal>, - <literal>c</literal>, and <literal>d</literal>. In this scenario, each of these letter - prefixes will be on a different region. After applying the salts, you have the following - rowkeys instead. Since you can now write to four separate regions, you theoretically have - four times the throughput when writing that you would have if all the writes were going to - the same region.</para> - <screen> -a-foo0003 -b-foo0001 -c-foo0004 -d-foo0002 - </screen> - <para>Then, if you add another row, it will randomly be assigned one of the four possible - salt values and end up near one of the existing rows.</para> - <screen> -a-foo0003 -b-foo0001 -<emphasis>c-foo0003</emphasis> -c-foo0004 -d-foo0002 - </screen> - <para>Since this assignment will be random, you will need to do more work if you want to - retrieve the rows in lexicographic order. In this way, salting attempts to increase - throughput on writes, but has a cost during reads.</para> - </example> - <para></para> - <formalpara> - <title>Hashing</title> - <para>Instead of a random assignment, you could use a one-way <firstterm>hash</firstterm> - that would cause a given row to always be "salted" with the same prefix, in a way that - would spread the load across the regionservers, but allow for predictability during reads. - Using a deterministic hash allows the client to reconstruct the complete rowkey and use a - Get operation to retrieve that row as normal.</para> - </formalpara> - <example> - <title>Hashing Example</title> - <para>Given the same situation in the salting example above, you could instead apply a - one-way hash that would cause the row with key <literal>foo0003</literal> to always, and - predictably, receive the <literal>a</literal> prefix. Then, to retrieve that row, you - would already know the key. You could also optimize things so that certain pairs of keys - were always in the same region, for instance.</para> - </example> - <formalpara> - <title>Reversing the Key</title> - <para>A third common trick for preventing hotspotting is to reverse a fixed-width or numeric - row key so that the part that changes the most often (the least significant digit) is first. - This effectively randomizes row keys, but sacrifices row ordering properties.</para> - </formalpara> - <para>See <link - xlink:href="https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables" - >https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables</link>, - and <link xlink:href="http://phoenix.apache.org/salted.html">article on Salted Tables</link> - from the Phoenix project, and the discussion in the comments of <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-11682">HBASE-11682</link> for more - information about avoiding hotspotting.</para> - </section> - <section - xml:id="timeseries"> - <title> Monotonically Increasing Row Keys/Timeseries Data </title> - <para> In the HBase chapter of Tom White's book <link - xlink:href="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</link> - (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import - process walks in lock-step with all clients in concert pounding one of the table's regions - (and thus, a single node), then moving onto the next region, etc. With monotonically - increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan - on why monotonically increasing row keys are problematic in BigTable-like datastores: <link - xlink:href="http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/">monotonically - increasing values are bad</link>. The pile-up on a single region brought on by - monotonically increasing keys can be mitigated by randomizing the input records to not be in - sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, - 3) as the row-key. </para> - <para>If you do need to upload time series data into HBase, you should study <link - xlink:href="http://opentsdb.net/">OpenTSDB</link> as a successful example. It has a page - describing the <link - xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in HBase. The key - format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at - first glance to contradict the previous advice about not using a timestamp as the key. - However, the difference is that the timestamp is not in the <emphasis>lead</emphasis> - position of the key, and the design assumption is that there are dozens or hundreds (or - more) of different metric types. Thus, even with a continual stream of input data with a mix - of metric types, the Puts are distributed across various points of regions in the table. </para> - <para>See <xref - linkend="schema.casestudies" /> for some rowkey design examples. </para> - </section> - <section - xml:id="keysize"> - <title>Try to minimize row and column sizes</title> - <subtitle>Or why are my StoreFile indices large?</subtitle> - <para>In HBase, values are always freighted with their coordinates; as a cell value passes - through the system, it'll be accompanied by its row, column name, and timestamp - always. If - your rows and column names are large, especially compared to the size of the cell value, - then you may run up against some interesting scenarios. One such is the case described by - Marc Limotte at the tail of <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272">HBASE-3551</link> - (recommended!). Therein, the indices that are kept on HBase storefiles (<xref - linkend="hfile" />) to facilitate random access may end up occupyng large chunks of the - HBase allotted RAM because the cell value coordinates are large. Mark in the above cited - comment suggests upping the block size so entries in the store file index happen at a larger - interval or modify the table schema so it makes for smaller rows and column names. - Compression will also make for larger indices. See the thread <link - xlink:href="http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&subj=a+question+storefileIndexSize">a - question storefileIndexSize</link> up on the user mailing list. </para> - <para>Most of the time small inefficiencies don't matter all that much. Unfortunately, this is - a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and - rowkeys they could be repeated several billion times in your data. </para> - <para>See <xref - linkend="keyvalue" /> for more information on HBase stores data internally to see why this - is important.</para> - <section - xml:id="keysize.cf"> - <title>Column Families</title> - <para>Try to keep the ColumnFamily names as small as possible, preferably one character - (e.g. "d" for data/default). </para> - <para>See <xref - linkend="keyvalue" /> for more information on HBase stores data internally to see why - this is important.</para> - </section> - <section - xml:id="keysize.attributes"> - <title>Attributes</title> - <para>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to - read, prefer shorter attribute names (e.g., "via") to store in HBase. </para> - <para>See <xref - linkend="keyvalue" /> for more information on HBase stores data internally to see why - this is important.</para> - </section> - <section - xml:id="keysize.row"> - <title>Rowkey Length</title> - <para>Keep them as short as is reasonable such that they can still be useful for required - data access (e.g., Get vs. Scan). A short key that is useless for data access is not - better than a longer key with better get/scan properties. Expect tradeoffs when designing - rowkeys. </para> - </section> - <section - xml:id="keysize.patterns"> - <title>Byte Patterns</title> - <para>A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 - in those eight bytes. If you stored this number as a String -- presuming a byte per - character -- you need nearly 3x the bytes. </para> - <para>Not convinced? Below is some sample code that you can run on your own.</para> - <programlisting language="java"> -// long -// -long l = 1234567890L; -byte[] lb = Bytes.toBytes(l); -System.out.println("long bytes length: " + lb.length); // returns 8 - -String s = "" + l; -byte[] sb = Bytes.toBytes(s); -System.out.println("long as string length: " + sb.length); // returns 10 - -// hash -// -MessageDigest md = MessageDigest.getInstance("MD5"); -byte[] digest = md.digest(Bytes.toBytes(s)); -System.out.println("md5 digest bytes length: " + digest.length); // returns 16 - -String sDigest = new String(digest); -byte[] sbDigest = Bytes.toBytes(sDigest); -System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26 - </programlisting> - <para>Unfortunately, using a binary representation of a type will make your data harder to - read outside of your code. For example, this is what you will see in the shell when you - increment a value:</para> - <programlisting> -hbase(main):001:0> incr 't', 'r', 'f:q', 1 -COUNTER VALUE = 1 - -hbase(main):002:0> get 't', 'r' -COLUMN CELL - f:q timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01 -1 row(s) in 0.0310 seconds - </programlisting> - <para>The shell makes a best effort to print a string, and it this case it decided to just - print the hex. The same will happen to your row keys inside the region names. It can be - okay if you know what's being stored, but it might also be unreadable if arbitrary data - can be put in the same cells. This is the main trade-off. </para> - </section> - - </section> - <section - xml:id="reverse.timestamp"> - <title>Reverse Timestamps</title> - <note> - <title>Reverse Scan API</title> - <para> - <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-4811">HBASE-4811</link> - implements an API to scan a table or a range within a table in reverse, reducing the need - to optimize your schema for forward or reverse scanning. This feature is available in - HBase 0.98 and later. See <link - xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28boolean" /> - for more information. </para> - </note> - - <para>A common problem in database processing is quickly finding the most recent version of a - value. A technique using reverse timestamps as a part of the key can help greatly with a - special case of this problem. Also found in the HBase chapter of Tom White's book Hadoop: - The Definitive Guide (O'Reilly), the technique involves appending (<code>Long.MAX_VALUE - - timestamp</code>) to the end of any key, e.g., [key][reverse_timestamp]. </para> - <para>The most recent value for [key] in a table can be found by performing a Scan for [key] - and obtaining the first record. Since HBase keys are in sorted order, this key sorts before - any older row-keys for [key] and thus is first. </para> - <para>This technique would be used instead of using <xref - linkend="schema.versions" /> where the intent is to hold onto all versions "forever" (or a - very long time) and at the same time quickly obtain access to any other version by using the - same Scan technique. </para> - </section> - <section - xml:id="rowkey.scope"> - <title>Rowkeys and ColumnFamilies</title> - <para>Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each - ColumnFamily that exists in a table without collision. </para> - </section> - <section - xml:id="changing.rowkeys"> - <title>Immutability of Rowkeys</title> - <para>Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row - is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so - it pays to get the rowkeys right the first time (and/or before you've inserted a lot of - data). </para> - </section> - <section - xml:id="rowkey.regionsplits"> - <title>Relationship Between RowKeys and Region Splits</title> - <para>If you pre-split your table, it is <emphasis>critical</emphasis> to understand how your - rowkey will be distributed across the region boundaries. As an example of why this is - important, consider the example of using displayable hex characters as the lead position of - the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through - <code>Bytes.split</code> (which is the split strategy used when creating regions in - <code>HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions)</code> for 10 - regions will generate the following splits...</para> - <screen> -48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 -54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 -61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = -68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D -75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K -82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R -88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X -95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ -102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f - </screen> - <para>... (note: the lead byte is listed to the right as a comment.) Given that the first - split is a '0' and the last split is an 'f', everything is great, right? Not so fast. </para> - <para>The problem is that all the data is going to pile up in the first 2 regions and the last - region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer - to an <link - xlink:href="http://www.asciitable.com">ASCII Table</link>. '0' is byte 48, and 'f' is byte - 102, but there is a huge gap in byte values (bytes 58 to 96) that will <emphasis>never - appear in this keyspace</emphasis> because the only values are [0-9] and [a-f]. Thus, the - middle regions regions will never be used. To make pre-spliting work with this example - keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) - is required. </para> - <para>Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split - them in such a way that all the regions are accessible in the keyspace. While this example - demonstrated the problem with a hex-key keyspace, the same problem can happen with - <emphasis>any</emphasis> keyspace. Know your data. </para> - <para>Lesson #2: While generally not advisable, using hex-keys (and more generally, - displayable data) can still work with pre-split tables as long as all the created regions - are accessible in the keyspace. </para> - <para>To conclude this example, the following is an example of how appropriate splits can be - pre-created for hex-keys:. </para> - <programlisting language="java"><![CDATA[public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) -throws IOException { - try { - admin.createTable( table, splits ); - return true; - } catch (TableExistsException e) { - logger.info("table " + table.getNameAsString() + " already exists"); - // the table already exists... - return false; - } -} - -public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { - byte[][] splits = new byte[numRegions-1][]; - BigInteger lowestKey = new BigInteger(startKey, 16); - BigInteger highestKey = new BigInteger(endKey, 16); - BigInteger range = highestKey.subtract(lowestKey); - BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); - lowestKey = lowestKey.add(regionIncrement); - for(int i=0; i < numRegions-1;i++) { - BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); - byte[] b = String.format("%016x", key).getBytes(); - splits[i] = b; - } - return splits; -}]]></programlisting> - </section> - </section> - <!-- rowkey design --> - <section - xml:id="schema.versions"> - <title> Number of Versions </title> - <section - xml:id="schema.versions.max"> - <title>Maximum Number of Versions</title> - <para>The maximum number of row versions to store is configured per column family via <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>. - The default for max versions is 1. This is an important parameter because as described in <xref - linkend="datamodel" /> section HBase does <emphasis>not</emphasis> overwrite row values, - but rather stores different values per row by time (and qualifier). Excess versions are - removed during major compactions. The number of max versions may need to be increased or - decreased depending on application needs. </para> - <para>It is not recommended setting the number of max versions to an exceedingly high level - (e.g., hundreds or more) unless those old values are very dear to you because this will - greatly increase StoreFile size. </para> - </section> - <section - xml:id="schema.minversions"> - <title> Minimum Number of Versions </title> - <para>Like maximum number of row versions, the minimum number of row versions to keep is - configured per column family via <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>. - The default for min versions is 0, which means the feature is disabled. The minimum number - of row versions parameter is used together with the time-to-live parameter and can be - combined with the number of row versions parameter to allow configurations such as "keep the - last T minutes worth of data, at most N versions, <emphasis>but keep at least M versions - around</emphasis>" (where M is the value for minimum number of row versions, M<N). This - parameter should only be set when time-to-live is enabled for a column family and must be - less than the number of row versions. </para> - </section> - </section> - <section - xml:id="supported.datatypes"> - <title> Supported Datatypes </title> - <para>HBase supports a "bytes-in/bytes-out" interface via <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link> - and <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html">Result</link>, - so anything that can be converted to an array of bytes can be stored as a value. Input could - be strings, numbers, complex objects, or even images as long as they can rendered as bytes. </para> - <para>There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase - would probably be too much to ask); search the mailling list for conversations on this topic. - All rows in HBase conform to the <xref - linkend="datamodel" />, and that includes versioning. Take that into consideration when - making your design, as well as block size for the ColumnFamily. </para> - - <section - xml:id="counters"> - <title>Counters</title> - <para> One supported datatype that deserves special mention are "counters" (i.e., the ability - to do atomic increments of numbers). See <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#increment%28org.apache.hadoop.hbase.client.Increment%29">Increment</link> - in HTable. </para> - <para>Synchronization on counters are done on the RegionServer, not in the client. </para> - </section> - </section> - <section - xml:id="schema.joins"> - <title>Joins</title> - <para>If you have multiple tables, don't forget to factor in the potential for <xref - linkend="joins" /> into the schema design. </para> - </section> - <section - xml:id="ttl"> - <title>Time To Live (TTL)</title> - <para> ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows - once the expiration time is reached. This applies to <emphasis>all</emphasis> versions of a - row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC. - </para> - <para> Store files which contains only expired rows are deleted on minor compaction. Setting - <varname>hbase.store.delete.expired.storefile</varname> to <code>false</code> disables this - feature. Setting <link linkend="schema.minversions">minimum number of versions</link> to other - than 0 also disables this.</para> - <para> See <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html" - >HColumnDescriptor</link> for more information. </para> - <para>Recent versions of HBase also support setting time to live on a per cell basis. See <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-10560">HBASE-10560</link> for more - information. Cell TTLs are submitted as an attribute on mutation requests (Appends, - Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied - to all cells updated on the server by the operation. There are two notable differences - between cell TTL handling and ColumnFamily TTLs:</para> - <itemizedlist> - <listitem> - <para>Cell TTLs are expressed in units of milliseconds instead of seconds.</para> - </listitem> - <listitem> - <para>A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level - TTL setting.</para> - </listitem> - </itemizedlist> - </section> - <section - xml:id="cf.keep.deleted"> - <title> Keeping Deleted Cells </title> - <para>By default, delete markers extend back to the beginning of time. Therefore, <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link> - or <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> - operations will not see a deleted cell (row or column), even when the Get or Scan operation - indicates a time range - before the delete marker was placed.</para> - <para>ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be - retrieved, as long as these operations specify a time range that ends before the timestamp of - any delete that would affect the cells. This allows for point-in-time queries even in the - presence of deletes. </para> - <para> Deleted cells are still subject to TTL and there will never be more than "maximum number - of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete - markers. </para> - <example> - <title>Change the Value of <code>KEEP_DELETED_CELLS</code> Using HBase Shell</title> - <screen>hbase> hbase> alter ât1â², NAME => âf1â², KEEP_DELETED_CELLS => true</screen> - </example> - <example> - <title>Change the Value of <code>KEEP_DELETED_CELLS</code> Using the API</title> - <programlisting language="java">... -HColumnDescriptor.setKeepDeletedCells(true); -... - </programlisting> - </example> - <para>See the API documentation for <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html#KEEP_DELETED_CELLS" - >KEEP_DELETED_CELLS</link> for more information. </para> - </section> - <section - xml:id="secondary.indexes"> - <title> Secondary Indexes and Alternate Query Paths </title> - <para>This section could also be titled "what if my table rowkey looks like - <emphasis>this</emphasis> but I also want to query my table like <emphasis>that</emphasis>." - A common example on the dist-list is where a row-key is of the format "user-timestamp" but - there are reporting requirements on activity across users for certain time ranges. Thus, - selecting by user is easy because it is in the lead position of the key, but time is not. </para> - <para>There is no single answer on the best way to handle this because it depends on... </para> - <itemizedlist> - <listitem> - <para>Number of users</para> - </listitem> - <listitem> - <para>Data size and data arrival rate</para> - </listitem> - <listitem> - <para>Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. - pre-configured ranges) </para> - </listitem> - <listitem> - <para>Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an - ad-hoc report, whereas it may be too long for others) </para> - </listitem> - </itemizedlist> - <para>... and solutions are also influenced by the size of the cluster and how much processing - power you have to throw at the solution. Common techniques are in sub-sections below. This is - a comprehensive, but not exhaustive, list of approaches. </para> - <para>It should not be a surprise that secondary indexes require additional cluster space and - processing. This is precisely what happens in an RDBMS because the act of creating an - alternate index requires both space and processing cycles to update. RDBMS products are more - advanced in this regard to handle alternative index management out of the box. However, HBase - scales better at larger data volumes, so this is a feature trade-off. </para> - <para>Pay attention to <xref - linkend="performance" /> when implementing any of these approaches.</para> - <para>Additionally, see the David Butler response in this dist-list thread <link - xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&subj=Stargate+hbase">HBase, - mail # user - Stargate+hbase</link> - </para> - <section - xml:id="secondary.indexes.filter"> - <title> Filter Query </title> - <para>Depending on the case, it may be appropriate to use <xref - linkend="client.filter" />. In this case, no secondary index is created. However, don't - try a full-scan on a large table like this from an application (i.e., single-threaded - client). </para> - </section> - <section - xml:id="secondary.indexes.periodic"> - <title> Periodic-Update Secondary Index </title> - <para>A secondary index could be created in an other table which is periodically updated via a - MapReduce job. The job could be executed intra-day, but depending on load-strategy it could - still potentially be out of sync with the main data table.</para> - <para>See <xref - linkend="mapreduce.example.readwrite" /> for more information.</para> - </section> - <section - xml:id="secondary.indexes.dualwrite"> - <title> Dual-Write Secondary Index </title> - <para>Another strategy is to build the secondary index while publishing data to the cluster - (e.g., write to data table, write to index table). If this is approach is taken after a data - table already exists, then bootstrapping will be needed for the secondary index with a - MapReduce job (see <xref - linkend="secondary.indexes.periodic" />).</para> - </section> - <section - xml:id="secondary.indexes.summary"> - <title> Summary Tables </title> - <para>Where time-ranges are very wide (e.g., year-long report) and where the data is - voluminous, summary tables are a common approach. These would be generated with MapReduce - jobs into another table.</para> - <para>See <xref - linkend="mapreduce.example.summary" /> for more information.</para> - </section> - <section - xml:id="secondary.indexes.coproc"> - <title> Coprocessor Secondary Index </title> - <para>Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, - see <xref - linkend="coprocessors" /> - </para> - </section> - </section> - <section - xml:id="constraints"> - <title>Constraints</title> - <para>HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised - usage for Constraints is in enforcing business rules for attributes in the table (eg. make - sure values are in the range 1-10). Constraints could also be used to enforce referential - integrity, but this is strongly discouraged as it will dramatically decrease the write - throughput of the tables where integrity checking is enabled. Extensive documentation on using - Constraints can be found at: <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/constraint">Constraint</link> - since version 0.94. </para> - </section> - <section - xml:id="schema.casestudies"> - <title>Schema Design Case Studies</title> - <para>The following will describe some typical data ingestion use-cases with HBase, and how the - rowkey design and construction can be approached. Note: this is just an illustration of - potential approaches, not an exhaustive list. Know your data, and know your processing - requirements. </para> - <para>It is highly recommended that you read the rest of the <xref - linkend="schema" /> first, before reading these case studies. </para> - <para>The following case studies are described: </para> - <itemizedlist> - <listitem> - <para>Log Data / Timeseries Data</para> - </listitem> - <listitem> - <para>Log Data / Timeseries on Steroids</para> - </listitem> - <listitem> - <para>Customer/Order</para> - </listitem> - <listitem> - <para>Tall/Wide/Middle Schema Design</para> - </listitem> - <listitem> - <para>List Data</para> - </listitem> - </itemizedlist> - <section - xml:id="schema.casestudies.log-timeseries"> - <title>Case Study - Log Data and Timeseries Data</title> - <para>Assume that the following data elements are being collected. </para> - <itemizedlist> - <listitem> - <para>Hostname</para> - </listitem> - <listitem> - <para>Timestamp</para> - </listitem> - <listitem> - <para>Log event</para> - </listitem> - <listitem> - <para>Value/message</para> - </listitem> - </itemizedlist> - <para>We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From - these attributes the rowkey will be some combination of hostname, timestamp, and log-event - - but what specifically? </para> - <section - xml:id="schema.casestudies.log-timeseries.tslead"> - <title>Timestamp In The Rowkey Lead Position</title> - <para>The rowkey <code>[timestamp][hostname][log-event]</code> suffers from the - monotonically increasing rowkey problem described in <xref - linkend="timeseries" />. </para> - <para>There is another pattern frequently mentioned in the dist-lists about âbucketingâ - timestamps, by performing a mod operation on the timestamp. If time-oriented scans are - important, this could be a useful approach. Attention must be paid to the number of - buckets, because this will require the same number of scans to return results.</para> - <programlisting language="java"> -long bucket = timestamp % numBuckets; - </programlisting> - <para>⦠to construct:</para> - <programlisting> -[bucket][timestamp][hostname][log-event] - </programlisting> - <para>As stated above, to select data for a particular timerange, a Scan will need to be - performed for each bucket. 100 buckets, for example, will provide a wide distribution in - the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there - are trade-offs. </para> - </section> - <!-- ts lead --> - <section - xml:id="schema.casestudies.log-timeseries.hostlead"> - <title>Host In The Rowkey Lead Position</title> - <para>The rowkey <code>[hostname][log-event][timestamp]</code> is a candidate if there is a - large-ish number of hosts to spread the writes and reads across the keyspace. This - approach would be useful if scanning by hostname was a priority. </para> - </section> - <!-- host lead --> - <section - xml:id="schema.casestudies.log-timeseries.revts"> - <title>Timestamp, or Reverse Timestamp?</title> - <para>If the most important access path is to pull most recent events, then storing the - timestamps as reverse-timestamps (e.g., <code>timestamp = Long.MAX_VALUE â - timestamp</code>) will create the property of being able to do a Scan on - <code>[hostname][log-event]</code> to obtain the quickly obtain the most recently - captured events. </para> - <para>Neither approach is wrong, it just depends on what is most appropriate for the - situation. </para> - <note> - <title>Reverse Scan API</title> - <para> - <link - xlink:href="https://issues.apache.org/jira/browse/HBASE-4811">HBASE-4811</link> - implements an API to scan a table or a range within a table in reverse, reducing the - need to optimize your schema for forward or reverse scanning. This feature is available - in HBase 0.98 and later. See <link - xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28boolean" /> - for more information. </para> - </note> - </section> - <!-- revts --> - <section - xml:id="schema.casestudies.log-timeseries.varkeys"> - <title>Variangle Length or Fixed Length Rowkeys?</title> - <para>It is critical to remember that rowkeys are stamped on every column in HBase. If the - hostname is âaâ and the event type is âe1â then the resulting rowkey would be quite small. - However, what if the ingested hostname is âmyserver1.mycompany.comâ and the event type is - âcom.package1.subpackage2.subsubpackage3.ImportantServiceâ? </para> - <para>It might make sense to use some substitution in the rowkey. There are at least two - approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it - might look like this: </para> - <para>Composite Rowkey With Hashes:</para> - <itemizedlist> - <listitem> - <para>[MD5 hash of hostname] = 16 bytes</para> - </listitem> - <listitem> - <para>[MD5 hash of event-type] = 16 bytes</para> - </listitem> - <listitem> - <para>[timestamp] = 8 bytes</para> - </listitem> - </itemizedlist> - <para>Composite Rowkey With Numeric Substitution: </para> - <para>For this approach another lookup table would be needed in addition to LOG_DATA, called - LOG_TYPES. The rowkey of LOG_TYPES would be: </para> - <itemizedlist> - <listitem> - <para>[type] (e.g., byte indicating hostname vs. event-type)</para> - </listitem> - <listitem> - <para>[bytes] variable length bytes for raw hostname or event-type.</para> - </listitem> - </itemizedlist> - <para>A column for this rowkey could be a long with an assigned number, which could be - obtained by using an <link - xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long%29">HBase - counter</link>. </para> - <para>So the resulting composite rowkey would be: </para> - <itemizedlist> - <listitem> - <para>[substituted long for hostname] = 8 bytes</para> - </listitem> - <listitem> - <para>[substituted long for event type] = 8 bytes</para> - </listitem> - <listitem> - <para>[timestamp] = 8 bytes</para> - </listitem> - </itemizedlist> - <para>In either the Hash or Numeric substitution approach, the raw values for hostname and - event-type can be stored as columns. </para> - </section> - <!-- varkeys --> - </section> - <!-- log data and timeseries --> - <section - xml:id="schema.casestudies.log-steroids"> - <title>Case Study - Log Data and Timeseries Data on Steroids</title> - <para>This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack - rows into columns for certain time-periods. For a detailed explanation, see: <link - xlink:href="http://opentsdb.net/schema.html">http://opentsdb.net/schema.html</link>, and <link - xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-lessons-learned-from-opentsdb.html">Lessons - Learned from OpenTSDB</link> from HBaseCon2012. </para> - <para>But this is how the general concept works: data is ingested, for example, in this - mannerâ¦</para> - <screen> -[hostname][log-event][timestamp1] -[hostname][log-event][timestamp2] -[hostname][log-event][timestamp3] - </screen> - <para>⦠with separate rowkeys for each detailed event, but is re-written like this⦠</para> - <screen>[hostname][log-event][timerange]</screen> - <para>⦠and each of the above events are converted into columns stored with a time-offset - relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very - advanced processing technique, but HBase makes this possible. </para> - </section> - <!-- log data timeseries steroids --> - - <section - xml:id="schema.casestudies.custorder"> - <title>Case Study - Customer/Order</title> - <para>Assume that HBase is used to store customer and order information. There are two core - record-types being ingested: a Customer record type, and Order record type. </para> - <para>The Customer record type would include all the things that youâd typically expect: </para> - <itemizedlist> - <listitem> - <para>Customer number</para> - </listitem> - <listitem> - <para>Customer name</para> - </listitem> - <listitem> - <para>Address (e.g., city, state, zip)</para> - </listitem> - <listitem> - <para>Phone numbers, etc.</para> - </listitem> - </itemizedlist> - <para>The Order record type would include things like: </para> - <itemizedlist> - <listitem> - <para>Customer number</para> - </listitem> - <listitem> - <para>Order number</para> - </listitem> - <listitem> - <para>Sales date</para> - </listitem> - <listitem> - <para>A series of nested objects for shipping locations and line-items (see <xref - linkend="schema.casestudies.custorder.obj" /> for details)</para> - </listitem> - </itemizedlist> - <para>Assuming that the combination of customer number and sales order uniquely identify an - order, these two attributes will compose the rowkey, and specifically a composite key such - as: </para> - <screen>[customer number][order number]</screen> - <para>⦠for a ORDER table. However, there are more design decisions to make: are the - <emphasis>raw</emphasis> values the best choices for rowkeys? </para> - <para>The same design questions in the Log Data use-case confront us here. What is the - keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it - is advantageous to use fixed-length keys in HBase, as well as keys that can support a - reasonable spread in the keyspace, similar options appear: </para> - <para>Composite Rowkey With Hashes: </para> - <itemizedlist> - <listitem> - <para>[MD5 of customer number] = 16 bytes</para> - </listitem> - <listitem> - <para>[MD5 of order number] = 16 bytes</para> - </listitem> - </itemizedlist> - <para>Composite Numeric/Hash Combo Rowkey: </para> - <itemizedlist> - <listitem> - <para>[substituted long for customer number] = 8 bytes</para> - </listitem> - <listitem> - <para>[MD5 of order number] = 16 bytes</para> - </listitem> - </itemizedlist> - <section - xml:id="schema.casestudies.custorder.tables"> - <title>Single Table? Multiple Tables?</title> - <para>A traditional design approach would have separate tables for CUSTOMER and SALES. - Another option is to pack multiple record types into a single table (e.g., CUSTOMER++). </para> - <para>Customer Record Type Rowkey: </para> - <itemizedlist> - <listitem> - <para>[customer-id]</para> - </listitem> - <listitem> - <para>[type] = type indicating â1â for customer record type</para> - </listitem> - </itemizedlist> - <para>Order Record Type Rowkey: </para> - <itemizedlist> - <listitem> - <para>[customer-id]</para> - </listitem> - <listitem> - <para>[type] = type indicating â2â for order record type</para> - </listitem> - <listitem> - <para>[order]</para> - </listitem> - </itemizedlist> - <para>The advantage of this particular CUSTOMER++ approach is that organizes many different - record-types by customer-id (e.g., a single scan could get you everything about that - customer). The disadvantage is that itâs not as easy to scan for a particular record-type. - </para> - </section> - <section - xml:id="schema.casestudies.custorder.obj"> - <title>Order Object Design</title> - <para>Now we need to address how to model the Order object. Assume that the class structure - is as follows:</para> - <variablelist> - <varlistentry> - <term>Order</term> - <listitem> - <para>(an Order can have multiple ShippingLocations</para> - </listitem> - </varlistentry> - <varlistentry> - <term>LineItem</term> - <listitem> - <para>(a ShippingLocation can have multiple LineItems</para> - </listitem> - </varlistentry> - </variablelist> - <para>... there are multiple options on storing this data. </para> - <section - xml:id="schema.casestudies.custorder.obj.norm"> - <title>Completely Normalized</title> - <para>With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and - LINE_ITEM. </para> - <para>The ORDER table's rowkey was described above: <xref - linkend="schema.casestudies.custorder" /> - </para> - <para>The SHIPPING_LOCATION's composite rowkey would be something like this: </para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para> - </listitem> - </itemizedlist> - <para>The LINE_ITEM table's composite rowkey would be something like this: </para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para> - </listitem> - <listitem> - <para>[line item number] (e.g., 1st lineitem, 2nd, etc.)</para> - </listitem> - </itemizedlist> - <para>Such a normalized model is likely to be the approach with an RDBMS, but that's not - your only option with HBase. The cons of such an approach is that to retrieve - information about any Order, you will need: </para> - <itemizedlist> - <listitem> - <para>Get on the ORDER table for the Order</para> - </listitem> - <listitem> - <para>Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation - instances</para> - </listitem> - <listitem> - <para>Scan on the LINE_ITEM for each ShippingLocation</para> - </listitem> - </itemizedlist> - <para>... granted, this is what an RDBMS would do under the covers anyway, but since there - are no joins in HBase you're just more aware of this fact. </para> - </section> - <section - xml:id="schema.casestudies.custorder.obj.rectype"> - <title>Single Table With Record Types</title> - <para>With this approach, there would exist a single table ORDER that would contain </para> - <para>The Order rowkey was described above: <xref - linkend="schema.casestudies.custorder" /></para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[ORDER record type]</para> - </listitem> - </itemizedlist> - <para>The ShippingLocation composite rowkey would be something like this: </para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[SHIPPING record type]</para> - </listitem> - <listitem> - <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para> - </listitem> - </itemizedlist> - <para>The LineItem composite rowkey would be something like this: </para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[LINE record type]</para> - </listitem> - <listitem> - <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para> - </listitem> - <listitem> - <para>[line item number] (e.g., 1st lineitem, 2nd, etc.)</para> - </listitem> - </itemizedlist> - </section> - <section - xml:id="schema.casestudies.custorder.obj.denorm"> - <title>Denormalized</title> - <para>A variant of the Single Table With Record Types approach is to denormalize and - flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes - onto each LineItem instance. </para> - <para>The LineItem composite rowkey would be something like this: </para> - <itemizedlist> - <listitem> - <para>[order-rowkey]</para> - </listitem> - <listitem> - <para>[LINE record type]</para> - </listitem> - <listitem> - <para>[line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that - there are unique across the entire order)</para> - </listitem> - </itemizedlist> - <para>... and the LineItem columns would be something like this: </para> - <itemizedlist> - <listitem> - <para>itemNumber</para> - </listitem> - <listitem> - <para>quantity</para> - </listitem> - <listitem> - <para>price</para> - </listitem> - <listitem> - <para>shipToLine1 (denormalized from ShippingLocation)</para> - </listitem> - <listitem> - <para>shipToLine2 (denormalized from ShippingLocation)</para> - </listitem> - <listitem> - <para>shipToCity (denormalized from ShippingLocation)</para> - </listitem> - <listitem> - <para>shipToState (denormalized from ShippingLocation)</para> - </listitem> - <listitem> - <para>shipToZip (denormalized from ShippingLocation)</para> - </listitem> - </itemizedlist> - <para>The pros of this approach include a less complex object heirarchy, but one of the - cons is that updating gets more complicated in case any of this information changes. - </para> - </section> - <section - xml:id="schema.casestudies.custorder.obj.singleobj"> - <title>Object BLOB</title> - <para>With this approach, the entire Order object graph is treated, in one way or another, - as a BLOB. For example, the ORDER table's rowkey was described above: <xref - linkend="schema.casestudies.custorder" />, and a single column called "order" would - contain an object that could be deserialized that contained a container Order, - ShippingLocations, and LineItems. </para> - <para>There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, - etc. All of them are variants of the same approach: encode the object graph to a - byte-array. Care should be taken with this approach to ensure backward compatibilty in - case the object model changes such that older persisted structures can still be read - back out of HBase. </para> - <para>Pros are being able to manage complex object graphs with minimal I/O (e.g., a single - HBase Get per Order in this example), but the cons include the aforementioned warning - about backward compatiblity of serialization, language dependencies of serialization - (e.g., Java Serialization only works with Java clients), the fact that you have to - deserialize the entire object to get any piece of information inside the BLOB, and the - difficulty in getting frameworks like Hive to work with custom objects like this. - </para> - </section> - </section> - <!-- cust/order order object --> - </section> - <!-- cust/order --> - - <section - xml:id="schema.smackdown"> - <title>Case Study - "Tall/Wide/Middle" Schema Design Smackdown</title> - <para>This section will describe additional schema design questions that appear on the - dist-list, specifically about tall and wide tables. These are general guidelines and not - laws - each application must consider its own needs. </para> - <section - xml:id="schema.smackdown.rowsversions"> - <title>Rows vs. Versions</title> - <para>A common question is whether one should prefer rows or HBase's built-in-versioning. - The context is typically where there are "a lot" of versions of a row to be retained - (e.g., where it is significantly above the HBase default of 1 max versions). The - rows-approach would require storing a timestamp in some portion of the rowkey so that they - would not overwite with each successive update. </para> - <para>Preference: Rows (generally speaking). </para> - </section> - <section - xml:id="schema.smackdown.rowscols"> - <title>Rows vs. Columns</title> - <para>Another common question is whether one should prefer rows or columns. The context is - typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, - or 1 million rows with 1 columns apiece. </para> - <para>Preference: Rows (generally speaking). To be clear, this guideline is in the context - is in extremely wide cases, not in the standard use-case where one needs to store a few - dozen or hundred columns. But there is also a middle path between these two options, and - that is "Rows as Columns." </para> - </section> - <section - xml:id="schema.smackdown.rowsascols"> - <title>Rows as Columns</title> - <para>The middle path between Rows vs. Columns is packing data that would be a separate row - into columns, for certain rows. OpenTSDB is the best example of this case where a single - row represents a defined time-range, and then discrete events are treated as columns. This - approach is often more complex, and may require the additional complexity of re-writing - your data, but has the advantage of being I/O efficient. For an overview of this approach, - see <xref - linkend="schema.casestudies.log-steroids" />. </para> - </section> - </section> - <!-- note: the following id is not consistent with the others becaus it was formerly in the Case Studies chapter, - but I didn't want to break backward compatibility of the link. But future entries should look like the above case-study - links (schema.casestudies. ...) --> - <section - xml:id="casestudies.schema.listdata"> - <title>Case Study - List Data</title> - <para>The following is an exchange from the user dist-list regarding a fairly common question: - how to handle per-user list data in Apache HBase. </para> - <para>*** QUESTION ***</para> - <para> We're looking at how to store a large amount of (per-user) list data in HBase, and we - were trying to figure out what kind of access pattern made the most sense. One option is - store the majority of the data in a key, so we could have something like: </para> - - <programlisting><![CDATA[ -<FixedWidthUserName><FixedWidthValueId1>:"" (no value) -<FixedWidthUserName><FixedWidthValueId2>:"" (no value) -<FixedWidthUserName><FixedWidthValueId3>:"" (no value) -]]></programlisting> - - <para>The other option we had was to do this entirely using:</para> - <programlisting language="xml"><![CDATA[ -<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... -<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... - ]]></programlisting> - <para> where each row would contain multiple values. So in one case reading the first thirty - values would be: </para> - <programlisting language="java"><![CDATA[ -scan { STARTROW => 'FixedWidthUsername' LIMIT => 30} - ]]></programlisting> - <para>And in the second case it would be </para> - <programlisting> -get 'FixedWidthUserName\x00\x00\x00\x00' - </programlisting> - <para> The general usage pattern would be to read only the first 30 values of these lists, - with infrequent access reading deeper into the lists. Some users would have <= 30 total - values in these lists, and some users would have millions (i.e. power-law distribution) </para> - <para> The single-value format seems like it would take up more space on HBase, but would - offer some improved retrieval / pagination flexibility. Would there be any significant - performance advantages to be able to paginate via gets vs paginating with scans? </para> - <para> My initial understanding was that doing a scan should be faster if our paging size is - unknown (and caching is set appropriately), but that gets should be faster if we'll always - need the same page size. I've ended up hearing different people tell me opposite things - about performance. I assume the page sizes would be relatively consistent, so for most use - cases we could guarantee that we only wanted one page of data in the fixed-page-length case. - I would also assume that we would have infrequent updates, but may have inserts into the - middle of these lists (meaning we'd need to update all subsequent rows). </para> - <para> Thanks for help / suggestions / follow-up questions. </para> - <para>*** ANSWER ***</para> - <para> If I understand you correctly, you're ultimately trying to store triples in the form - "user, valueid, value", right? E.g., something like: </para> - <programlisting> -"user123, firstname, Paul", -"user234, lastname, Smith" - </programlisting> - <para> (But the usernames are fixed width, and the valueids are fixed width). </para> - <para> And, your access pattern is along the lines of: "for user X, list the next 30 values, - starting with valueid Y". Is that right? And these values should be returned sorted by - valueid? </para> - <para> The tl;dr version is that you should probably go with one row per user+value, and not - build a complicated intra-row pagination scheme on your own unless you're really sure it is - needed. </para> - <para> Your two options mirror a common question people have when designing HBase schemas: - should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for - one user, and so there are many rows in the table for each user; the row key is user + - valueid, and there would be (presumably) a single column qualifier that means "the value". - This is great if you want to scan over rows in sorted order by row key (thus my question - above, about whether these ids are sorted correctly). You can start a scan at any - user+valueid, read the next 30, and be done. What you're giving up is the ability to have - transactional guarantees around all the rows for one user, but it doesn't sound like you - need that. Doing it this way is generally recommended (see here <link - xlink:href="http://hbase.apache.org/book.html#schema.smackdown">http://hbase.apache.org/book.html#schema.smackdown</link>). </para> - <para> Your second option is "wide": you store a bunch of values in one row, using different - qualifiers (where the qualifier is the valueid). The simple way to do that would be to just - store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" - version because you're assuming that storing millions of columns in a single row would be - bad for performance, which may or may not be true; as long as you're not trying to do too - much in a single request, or do things like scanning over and returning all of the cells in - the row, it shouldn't be fundamentally worse. The client has methods that allow you to get - specific slices of columns. </para> - <para> Note that neither case fundamentally uses more disk space than the other; you're just - "shifting" part of the identifying information for a value either to the left (into the row - key, in option one) or to the right (into the column qualifiers in option 2). Under the - covers, every key/value still stores the whole row key, and column family name. (If this is - a bit confusing, take an hour and watch Lars George's excellent video about understanding - HBase schema design: <link - xlink:href="http://www.youtube.com/watch?v=_HLoH_PgrLk)">http://www.youtube.com/watch?v=_HLoH_PgrLk)</link>. </para> - <para> A manually paginated version has lots more complexities, as you note, like having to - keep track of how many things are in each page, re-shuffling if new values are inserted, - etc. That seems significantly more complex. It might have some slight speed advantages (or - disadvantages!) at extremely high throughput, and the only way to really know that would be - to try it out. If you don't have time to build it both ways and compare, my advice would be - to start with the simplest option (one row per user+value). Start simple and iterate! :) - </para> - </section> - <!-- listdata --> - - </section> - <!-- schema design cases --> - <section - xml:id="schema.ops"> - <title>Operational and Performance Configuration Options</title> - <para>See the Performance section <xref - linkend="perf.schema" /> for more information operational and performance schema design - options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes. - </para> - </section> - -</chapter> -<!-- schema design -->