http://git-wip-us.apache.org/repos/asf/hbase-site/blob/07527d7e/book.html ---------------------------------------------------------------------- diff --git a/book.html b/book.html index 1a4468f..5595a1d 100644 --- a/book.html +++ b/book.html @@ -39,237 +39,239 @@ <li><a href="#example_config">8. Example Configurations</a></li> <li><a href="#important_configurations">9. The Important Configurations</a></li> <li><a href="#dyn_config">10. Dynamic Configuration</a></li> +<li><a href="#amazon_s3_configuration">11. Using Amazon S3 Storage</a></li> </ul> </li> <li><a href="#upgrading">Upgrading</a> <ul class="sectlevel1"> -<li><a href="#hbase.versioning">11. HBase version number and compatibility</a></li> -<li><a href="#_upgrade_paths">12. Upgrade Paths</a></li> +<li><a href="#hbase.versioning">12. HBase version number and compatibility</a></li> +<li><a href="#_upgrade_paths">13. Upgrade Paths</a></li> </ul> </li> <li><a href="#shell">The Apache HBase Shell</a> <ul class="sectlevel1"> -<li><a href="#scripting">13. Scripting with Ruby</a></li> -<li><a href="#_running_the_shell_in_non_interactive_mode">14. Running the Shell in Non-Interactive Mode</a></li> -<li><a href="#hbase.shell.noninteractive">15. HBase Shell in OS Scripts</a></li> -<li><a href="#_read_hbase_shell_commands_from_a_command_file">16. Read HBase Shell Commands from a Command File</a></li> -<li><a href="#_passing_vm_options_to_the_shell">17. Passing VM Options to the Shell</a></li> -<li><a href="#_shell_tricks">18. Shell Tricks</a></li> +<li><a href="#scripting">14. Scripting with Ruby</a></li> +<li><a href="#_running_the_shell_in_non_interactive_mode">15. Running the Shell in Non-Interactive Mode</a></li> +<li><a href="#hbase.shell.noninteractive">16. HBase Shell in OS Scripts</a></li> +<li><a href="#_read_hbase_shell_commands_from_a_command_file">17. Read HBase Shell Commands from a Command File</a></li> +<li><a href="#_passing_vm_options_to_the_shell">18. Passing VM Options to the Shell</a></li> +<li><a href="#_shell_tricks">19. Shell Tricks</a></li> </ul> </li> <li><a href="#datamodel">Data Model</a> <ul class="sectlevel1"> -<li><a href="#conceptual.view">19. Conceptual View</a></li> -<li><a href="#physical.view">20. Physical View</a></li> -<li><a href="#_namespace">21. Namespace</a></li> -<li><a href="#_table">22. Table</a></li> -<li><a href="#_row">23. Row</a></li> -<li><a href="#columnfamily">24. Column Family</a></li> -<li><a href="#_cells">25. Cells</a></li> -<li><a href="#_data_model_operations">26. Data Model Operations</a></li> -<li><a href="#versions">27. Versions</a></li> -<li><a href="#dm.sort">28. Sort Order</a></li> -<li><a href="#dm.column.metadata">29. Column Metadata</a></li> -<li><a href="#joins">30. Joins</a></li> -<li><a href="#_acid">31. ACID</a></li> +<li><a href="#conceptual.view">20. Conceptual View</a></li> +<li><a href="#physical.view">21. Physical View</a></li> +<li><a href="#_namespace">22. Namespace</a></li> +<li><a href="#_table">23. Table</a></li> +<li><a href="#_row">24. Row</a></li> +<li><a href="#columnfamily">25. Column Family</a></li> +<li><a href="#_cells">26. Cells</a></li> +<li><a href="#_data_model_operations">27. Data Model Operations</a></li> +<li><a href="#versions">28. Versions</a></li> +<li><a href="#dm.sort">29. Sort Order</a></li> +<li><a href="#dm.column.metadata">30. Column Metadata</a></li> +<li><a href="#joins">31. Joins</a></li> +<li><a href="#_acid">32. ACID</a></li> </ul> </li> <li><a href="#schema">HBase and Schema Design</a> <ul class="sectlevel1"> -<li><a href="#schema.creation">32. Schema Creation</a></li> -<li><a href="#table_schema_rules_of_thumb">33. Table Schema Rules Of Thumb</a></li> +<li><a href="#schema.creation">33. Schema Creation</a></li> +<li><a href="#table_schema_rules_of_thumb">34. Table Schema Rules Of Thumb</a></li> </ul> </li> <li><a href="#regionserver_sizing_rules_of_thumb">RegionServer Sizing Rules of Thumb</a> <ul class="sectlevel1"> -<li><a href="#number.of.cfs">34. On the number of column families</a></li> -<li><a href="#rowkey.design">35. Rowkey Design</a></li> -<li><a href="#schema.versions">36. Number of Versions</a></li> -<li><a href="#supported.datatypes">37. Supported Datatypes</a></li> -<li><a href="#schema.joins">38. Joins</a></li> -<li><a href="#ttl">39. Time To Live (TTL)</a></li> -<li><a href="#cf.keep.deleted">40. Keeping Deleted Cells</a></li> -<li><a href="#secondary.indexes">41. Secondary Indexes and Alternate Query Paths</a></li> -<li><a href="#_constraints">42. Constraints</a></li> -<li><a href="#schema.casestudies">43. Schema Design Case Studies</a></li> -<li><a href="#schema.ops">44. Operational and Performance Configuration Options</a></li> +<li><a href="#number.of.cfs">35. On the number of column families</a></li> +<li><a href="#rowkey.design">36. Rowkey Design</a></li> +<li><a href="#schema.versions">37. Number of Versions</a></li> +<li><a href="#supported.datatypes">38. Supported Datatypes</a></li> +<li><a href="#schema.joins">39. Joins</a></li> +<li><a href="#ttl">40. Time To Live (TTL)</a></li> +<li><a href="#cf.keep.deleted">41. Keeping Deleted Cells</a></li> +<li><a href="#secondary.indexes">42. Secondary Indexes and Alternate Query Paths</a></li> +<li><a href="#_constraints">43. Constraints</a></li> +<li><a href="#schema.casestudies">44. Schema Design Case Studies</a></li> +<li><a href="#schema.ops">45. Operational and Performance Configuration Options</a></li> </ul> </li> <li><a href="#mapreduce">HBase and MapReduce</a> <ul class="sectlevel1"> -<li><a href="#hbase.mapreduce.classpath">45. HBase, MapReduce, and the CLASSPATH</a></li> -<li><a href="#_mapreduce_scan_caching">46. MapReduce Scan Caching</a></li> -<li><a href="#_bundled_hbase_mapreduce_jobs">47. Bundled HBase MapReduce Jobs</a></li> -<li><a href="#_hbase_as_a_mapreduce_job_data_source_and_data_sink">48. HBase as a MapReduce Job Data Source and Data Sink</a></li> -<li><a href="#_writing_hfiles_directly_during_bulk_import">49. Writing HFiles Directly During Bulk Import</a></li> -<li><a href="#_rowcounter_example">50. RowCounter Example</a></li> -<li><a href="#splitter">51. Map-Task Splitting</a></li> -<li><a href="#mapreduce.example">52. HBase MapReduce Examples</a></li> -<li><a href="#mapreduce.htable.access">53. Accessing Other HBase Tables in a MapReduce Job</a></li> -<li><a href="#mapreduce.specex">54. Speculative Execution</a></li> -<li><a href="#cascading">55. Cascading</a></li> +<li><a href="#hbase.mapreduce.classpath">46. HBase, MapReduce, and the CLASSPATH</a></li> +<li><a href="#_mapreduce_scan_caching">47. MapReduce Scan Caching</a></li> +<li><a href="#_bundled_hbase_mapreduce_jobs">48. Bundled HBase MapReduce Jobs</a></li> +<li><a href="#_hbase_as_a_mapreduce_job_data_source_and_data_sink">49. HBase as a MapReduce Job Data Source and Data Sink</a></li> +<li><a href="#_writing_hfiles_directly_during_bulk_import">50. Writing HFiles Directly During Bulk Import</a></li> +<li><a href="#_rowcounter_example">51. RowCounter Example</a></li> +<li><a href="#splitter">52. Map-Task Splitting</a></li> +<li><a href="#mapreduce.example">53. HBase MapReduce Examples</a></li> +<li><a href="#mapreduce.htable.access">54. Accessing Other HBase Tables in a MapReduce Job</a></li> +<li><a href="#mapreduce.specex">55. Speculative Execution</a></li> +<li><a href="#cascading">56. Cascading</a></li> </ul> </li> <li><a href="#security">Securing Apache HBase</a> <ul class="sectlevel1"> -<li><a href="#_using_secure_http_https_for_the_web_ui">56. Using Secure HTTP (HTTPS) for the Web UI</a></li> -<li><a href="#hbase.secure.configuration">57. Secure Client Access to Apache HBase</a></li> -<li><a href="#hbase.secure.simpleconfiguration">58. Simple User Access to Apache HBase</a></li> -<li><a href="#_securing_access_to_hdfs_and_zookeeper">59. Securing Access to HDFS and ZooKeeper</a></li> -<li><a href="#_securing_access_to_your_data">60. Securing Access To Your Data</a></li> -<li><a href="#security.example.config">61. Security Configuration Example</a></li> +<li><a href="#_using_secure_http_https_for_the_web_ui">57. Using Secure HTTP (HTTPS) for the Web UI</a></li> +<li><a href="#hbase.secure.configuration">58. Secure Client Access to Apache HBase</a></li> +<li><a href="#hbase.secure.simpleconfiguration">59. Simple User Access to Apache HBase</a></li> +<li><a href="#_securing_access_to_hdfs_and_zookeeper">60. Securing Access to HDFS and ZooKeeper</a></li> +<li><a href="#_securing_access_to_your_data">61. Securing Access To Your Data</a></li> +<li><a href="#security.example.config">62. Security Configuration Example</a></li> </ul> </li> <li><a href="#_architecture">Architecture</a> <ul class="sectlevel1"> -<li><a href="#arch.overview">62. Overview</a></li> -<li><a href="#arch.catalog">63. Catalog Tables</a></li> -<li><a href="#architecture.client">64. Client</a></li> -<li><a href="#client.filter">65. Client Request Filters</a></li> -<li><a href="#architecture.master">66. Master</a></li> -<li><a href="#regionserver.arch">67. RegionServer</a></li> -<li><a href="#regions.arch">68. Regions</a></li> -<li><a href="#arch.bulk.load">69. Bulk Loading</a></li> -<li><a href="#arch.hdfs">70. HDFS</a></li> -<li><a href="#arch.timelineconsistent.reads">71. Timeline-consistent High Available Reads</a></li> -<li><a href="#hbase_mob">72. Storing Medium-sized Objects (MOB)</a></li> +<li><a href="#arch.overview">63. Overview</a></li> +<li><a href="#arch.catalog">64. Catalog Tables</a></li> +<li><a href="#architecture.client">65. Client</a></li> +<li><a href="#client.filter">66. Client Request Filters</a></li> +<li><a href="#architecture.master">67. Master</a></li> +<li><a href="#regionserver.arch">68. RegionServer</a></li> +<li><a href="#regions.arch">69. Regions</a></li> +<li><a href="#arch.bulk.load">70. Bulk Loading</a></li> +<li><a href="#arch.hdfs">71. HDFS</a></li> +<li><a href="#arch.timelineconsistent.reads">72. Timeline-consistent High Available Reads</a></li> +<li><a href="#hbase_mob">73. Storing Medium-sized Objects (MOB)</a></li> </ul> </li> <li><a href="#hbase_apis">Apache HBase APIs</a> <ul class="sectlevel1"> -<li><a href="#_examples">73. Examples</a></li> +<li><a href="#_examples">74. Examples</a></li> </ul> </li> <li><a href="#external_apis">Apache HBase External APIs</a> <ul class="sectlevel1"> -<li><a href="#_rest">74. REST</a></li> -<li><a href="#_thrift">75. Thrift</a></li> -<li><a href="#c">76. C/C++ Apache HBase Client</a></li> -<li><a href="#jdo">77. Using Java Data Objects (JDO) with HBase</a></li> -<li><a href="#scala">78. Scala</a></li> -<li><a href="#jython">79. Jython</a></li> +<li><a href="#_rest">75. REST</a></li> +<li><a href="#_thrift">76. Thrift</a></li> +<li><a href="#c">77. C/C++ Apache HBase Client</a></li> +<li><a href="#jdo">78. Using Java Data Objects (JDO) with HBase</a></li> +<li><a href="#scala">79. Scala</a></li> +<li><a href="#jython">80. Jython</a></li> </ul> </li> <li><a href="#thrift">Thrift API and Filter Language</a> <ul class="sectlevel1"> -<li><a href="#thrift.filter_language">80. Filter Language</a></li> +<li><a href="#thrift.filter_language">81. Filter Language</a></li> </ul> </li> <li><a href="#spark">HBase and Spark</a> <ul class="sectlevel1"> -<li><a href="#_basic_spark">81. Basic Spark</a></li> -<li><a href="#_spark_streaming">82. Spark Streaming</a></li> -<li><a href="#_bulk_load">83. Bulk Load</a></li> -<li><a href="#_sparksql_dataframes">84. SparkSQL/DataFrames</a></li> +<li><a href="#_basic_spark">82. Basic Spark</a></li> +<li><a href="#_spark_streaming">83. Spark Streaming</a></li> +<li><a href="#_bulk_load">84. Bulk Load</a></li> +<li><a href="#_sparksql_dataframes">85. SparkSQL/DataFrames</a></li> </ul> </li> <li><a href="#cp">Apache HBase Coprocessors</a> <ul class="sectlevel1"> -<li><a href="#_coprocessor_overview">85. Coprocessor Overview</a></li> -<li><a href="#_types_of_coprocessors">86. Types of Coprocessors</a></li> -<li><a href="#cp_loading">87. Loading Coprocessors</a></li> -<li><a href="#cp_example">88. Examples</a></li> -<li><a href="#_guidelines_for_deploying_a_coprocessor">89. Guidelines For Deploying A Coprocessor</a></li> -<li><a href="#_monitor_time_spent_in_coprocessors">90. Monitor Time Spent in Coprocessors</a></li> +<li><a href="#_coprocessor_overview">86. Coprocessor Overview</a></li> +<li><a href="#_types_of_coprocessors">87. Types of Coprocessors</a></li> +<li><a href="#cp_loading">88. Loading Coprocessors</a></li> +<li><a href="#cp_example">89. Examples</a></li> +<li><a href="#_guidelines_for_deploying_a_coprocessor">90. Guidelines For Deploying A Coprocessor</a></li> +<li><a href="#_monitor_time_spent_in_coprocessors">91. Monitor Time Spent in Coprocessors</a></li> </ul> </li> <li><a href="#performance">Apache HBase Performance Tuning</a> <ul class="sectlevel1"> -<li><a href="#perf.os">91. Operating System</a></li> -<li><a href="#perf.network">92. Network</a></li> -<li><a href="#jvm">93. Java</a></li> -<li><a href="#perf.configurations">94. HBase Configurations</a></li> -<li><a href="#perf.zookeeper">95. ZooKeeper</a></li> -<li><a href="#perf.schema">96. Schema Design</a></li> -<li><a href="#perf.general">97. HBase General Patterns</a></li> -<li><a href="#perf.writing">98. Writing to HBase</a></li> -<li><a href="#perf.reading">99. Reading from HBase</a></li> -<li><a href="#perf.deleting">100. Deleting from HBase</a></li> -<li><a href="#perf.hdfs">101. HDFS</a></li> -<li><a href="#perf.ec2">102. Amazon EC2</a></li> -<li><a href="#perf.hbase.mr.cluster">103. Collocating HBase and MapReduce</a></li> -<li><a href="#perf.casestudy">104. Case Studies</a></li> +<li><a href="#perf.os">92. Operating System</a></li> +<li><a href="#perf.network">93. Network</a></li> +<li><a href="#jvm">94. Java</a></li> +<li><a href="#perf.configurations">95. HBase Configurations</a></li> +<li><a href="#perf.zookeeper">96. ZooKeeper</a></li> +<li><a href="#perf.schema">97. Schema Design</a></li> +<li><a href="#perf.general">98. HBase General Patterns</a></li> +<li><a href="#perf.writing">99. Writing to HBase</a></li> +<li><a href="#perf.reading">100. Reading from HBase</a></li> +<li><a href="#perf.deleting">101. Deleting from HBase</a></li> +<li><a href="#perf.hdfs">102. HDFS</a></li> +<li><a href="#perf.ec2">103. Amazon EC2</a></li> +<li><a href="#perf.hbase.mr.cluster">104. Collocating HBase and MapReduce</a></li> +<li><a href="#perf.casestudy">105. Case Studies</a></li> </ul> </li> <li><a href="#trouble">Troubleshooting and Debugging Apache HBase</a> <ul class="sectlevel1"> -<li><a href="#trouble.general">105. General Guidelines</a></li> -<li><a href="#trouble.log">106. Logs</a></li> -<li><a href="#trouble.resources">107. Resources</a></li> -<li><a href="#trouble.tools">108. Tools</a></li> -<li><a href="#trouble.client">109. Client</a></li> -<li><a href="#trouble.mapreduce">110. MapReduce</a></li> -<li><a href="#trouble.namenode">111. NameNode</a></li> -<li><a href="#trouble.network">112. Network</a></li> -<li><a href="#trouble.rs">113. RegionServer</a></li> -<li><a href="#trouble.master">114. Master</a></li> -<li><a href="#trouble.zookeeper">115. ZooKeeper</a></li> -<li><a href="#trouble.ec2">116. Amazon EC2</a></li> -<li><a href="#trouble.versions">117. HBase and Hadoop version issues</a></li> -<li><a href="#_ipc_configuration_conflicts_with_hadoop">118. IPC Configuration Conflicts with Hadoop</a></li> -<li><a href="#_hbase_and_hdfs">119. HBase and HDFS</a></li> -<li><a href="#trouble.tests">120. Running unit or integration tests</a></li> -<li><a href="#trouble.casestudy">121. Case Studies</a></li> -<li><a href="#trouble.crypto">122. Cryptographic Features</a></li> -<li><a href="#_operating_system_specific_issues">123. Operating System Specific Issues</a></li> -<li><a href="#_jdk_issues">124. JDK Issues</a></li> +<li><a href="#trouble.general">106. General Guidelines</a></li> +<li><a href="#trouble.log">107. Logs</a></li> +<li><a href="#trouble.resources">108. Resources</a></li> +<li><a href="#trouble.tools">109. Tools</a></li> +<li><a href="#trouble.client">110. Client</a></li> +<li><a href="#trouble.mapreduce">111. MapReduce</a></li> +<li><a href="#trouble.namenode">112. NameNode</a></li> +<li><a href="#trouble.network">113. Network</a></li> +<li><a href="#trouble.rs">114. RegionServer</a></li> +<li><a href="#trouble.master">115. Master</a></li> +<li><a href="#trouble.zookeeper">116. ZooKeeper</a></li> +<li><a href="#trouble.ec2">117. Amazon EC2</a></li> +<li><a href="#trouble.versions">118. HBase and Hadoop version issues</a></li> +<li><a href="#_ipc_configuration_conflicts_with_hadoop">119. IPC Configuration Conflicts with Hadoop</a></li> +<li><a href="#_hbase_and_hdfs">120. HBase and HDFS</a></li> +<li><a href="#trouble.tests">121. Running unit or integration tests</a></li> +<li><a href="#trouble.casestudy">122. Case Studies</a></li> +<li><a href="#trouble.crypto">123. Cryptographic Features</a></li> +<li><a href="#_operating_system_specific_issues">124. Operating System Specific Issues</a></li> +<li><a href="#_jdk_issues">125. JDK Issues</a></li> </ul> </li> <li><a href="#casestudies">Apache HBase Case Studies</a> <ul class="sectlevel1"> -<li><a href="#casestudies.overview">125. Overview</a></li> -<li><a href="#casestudies.schema">126. Schema Design</a></li> -<li><a href="#casestudies.perftroub">127. Performance/Troubleshooting</a></li> +<li><a href="#casestudies.overview">126. Overview</a></li> +<li><a href="#casestudies.schema">127. Schema Design</a></li> +<li><a href="#casestudies.perftroub">128. Performance/Troubleshooting</a></li> </ul> </li> <li><a href="#ops_mgt">Apache HBase Operational Management</a> <ul class="sectlevel1"> -<li><a href="#tools">128. HBase Tools and Utilities</a></li> -<li><a href="#ops.regionmgt">129. Region Management</a></li> -<li><a href="#node.management">130. Node Management</a></li> -<li><a href="#hbase_metrics">131. HBase Metrics</a></li> -<li><a href="#ops.monitoring">132. HBase Monitoring</a></li> -<li><a href="#_cluster_replication">133. Cluster Replication</a></li> -<li><a href="#_running_multiple_workloads_on_a_single_cluster">134. Running Multiple Workloads On a Single Cluster</a></li> -<li><a href="#ops.backup">135. HBase Backup</a></li> -<li><a href="#ops.snapshots">136. HBase Snapshots</a></li> -<li><a href="#ops.capacity">137. Capacity Planning and Region Sizing</a></li> -<li><a href="#table.rename">138. Table Rename</a></li> +<li><a href="#tools">129. HBase Tools and Utilities</a></li> +<li><a href="#ops.regionmgt">130. Region Management</a></li> +<li><a href="#node.management">131. Node Management</a></li> +<li><a href="#hbase_metrics">132. HBase Metrics</a></li> +<li><a href="#ops.monitoring">133. HBase Monitoring</a></li> +<li><a href="#_cluster_replication">134. Cluster Replication</a></li> +<li><a href="#_running_multiple_workloads_on_a_single_cluster">135. Running Multiple Workloads On a Single Cluster</a></li> +<li><a href="#ops.backup">136. HBase Backup</a></li> +<li><a href="#ops.snapshots">137. HBase Snapshots</a></li> +<li><a href="#snapshots_azure">138. Storing Snapshots in Microsoft Azure Blob Storage</a></li> +<li><a href="#ops.capacity">139. Capacity Planning and Region Sizing</a></li> +<li><a href="#table.rename">140. Table Rename</a></li> </ul> </li> <li><a href="#developer">Building and Developing Apache HBase</a> <ul class="sectlevel1"> -<li><a href="#getting.involved">139. Getting Involved</a></li> -<li><a href="#repos">140. Apache HBase Repositories</a></li> -<li><a href="#_ides">141. IDEs</a></li> -<li><a href="#build">142. Building Apache HBase</a></li> -<li><a href="#releasing">143. Releasing Apache HBase</a></li> -<li><a href="#hbase.rc.voting">144. Voting on Release Candidates</a></li> -<li><a href="#documentation">145. Generating the HBase Reference Guide</a></li> -<li><a href="#hbase.org">146. Updating <a href="http://hbase.apache.org">hbase.apache.org</a></a></li> -<li><a href="#hbase.tests">147. Tests</a></li> -<li><a href="#developing">148. Developer Guidelines</a></li> +<li><a href="#getting.involved">141. Getting Involved</a></li> +<li><a href="#repos">142. Apache HBase Repositories</a></li> +<li><a href="#_ides">143. IDEs</a></li> +<li><a href="#build">144. Building Apache HBase</a></li> +<li><a href="#releasing">145. Releasing Apache HBase</a></li> +<li><a href="#hbase.rc.voting">146. Voting on Release Candidates</a></li> +<li><a href="#documentation">147. Generating the HBase Reference Guide</a></li> +<li><a href="#hbase.org">148. Updating <a href="http://hbase.apache.org">hbase.apache.org</a></a></li> +<li><a href="#hbase.tests">149. Tests</a></li> +<li><a href="#developing">150. Developer Guidelines</a></li> </ul> </li> <li><a href="#unit.tests">Unit Testing HBase Applications</a> <ul class="sectlevel1"> -<li><a href="#_junit">149. JUnit</a></li> -<li><a href="#mockito">150. Mockito</a></li> -<li><a href="#_mrunit">151. MRUnit</a></li> -<li><a href="#_integration_testing_with_an_hbase_mini_cluster">152. Integration Testing with an HBase Mini-Cluster</a></li> +<li><a href="#_junit">151. JUnit</a></li> +<li><a href="#mockito">152. Mockito</a></li> +<li><a href="#_mrunit">153. MRUnit</a></li> +<li><a href="#_integration_testing_with_an_hbase_mini_cluster">154. Integration Testing with an HBase Mini-Cluster</a></li> </ul> </li> <li><a href="#zookeeper">ZooKeeper</a> <ul class="sectlevel1"> -<li><a href="#_using_existing_zookeeper_ensemble">153. Using existing ZooKeeper ensemble</a></li> -<li><a href="#zk.sasl.auth">154. SASL Authentication with ZooKeeper</a></li> +<li><a href="#_using_existing_zookeeper_ensemble">155. Using existing ZooKeeper ensemble</a></li> +<li><a href="#zk.sasl.auth">156. SASL Authentication with ZooKeeper</a></li> </ul> </li> <li><a href="#community">Community</a> <ul class="sectlevel1"> -<li><a href="#_decisions">155. Decisions</a></li> -<li><a href="#community.roles">156. Community Roles</a></li> -<li><a href="#hbase.commit.msg.format">157. Commit Message format</a></li> +<li><a href="#_decisions">157. Decisions</a></li> +<li><a href="#community.roles">158. Community Roles</a></li> +<li><a href="#hbase.commit.msg.format">159. Commit Message format</a></li> </ul> </li> <li><a href="#_appendix">Appendix</a> @@ -279,7 +281,7 @@ <li><a href="#hbck.in.depth">Appendix C: hbck In Depth</a></li> <li><a href="#appendix_acl_matrix">Appendix D: Access Control Matrix</a></li> <li><a href="#compression">Appendix E: Compression and Data Block Encoding In HBase</a></li> -<li><a href="#data.block.encoding.enable">158. Enable Data Block Encoding</a></li> +<li><a href="#data.block.encoding.enable">160. Enable Data Block Encoding</a></li> <li><a href="#sql">Appendix F: SQL over HBase</a></li> <li><a href="#ycsb">Appendix G: YCSB</a></li> <li><a href="#_hfile_format_2">Appendix H: HFile format</a></li> @@ -288,8 +290,8 @@ <li><a href="#asf">Appendix K: HBase and the Apache Software Foundation</a></li> <li><a href="#orca">Appendix L: Apache HBase Orca</a></li> <li><a href="#tracing">Appendix M: Enabling Dapper-like Tracing in HBase</a></li> -<li><a href="#tracing.client.modifications">159. Client Modifications</a></li> -<li><a href="#tracing.client.shell">160. Tracing from HBase Shell</a></li> +<li><a href="#tracing.client.modifications">161. Client Modifications</a></li> +<li><a href="#tracing.client.shell">162. Tracing from HBase Shell</a></li> <li><a href="#hbase.rpc">Appendix N: 0.95 RPC Specification</a></li> </ul> </li> @@ -5717,6 +5719,58 @@ For the full list consult the patch attached to <a href="https://issues.apache. </div> </div> </div> +<div class="sect1"> +<h2 id="amazon_s3_configuration"><a class="anchor" href="#amazon_s3_configuration"></a>11. Using Amazon S3 Storage</h2> +<div class="sectionbody"> +<div class="paragraph"> +<p>HBase is designed to be tightly coupled with HDFS, and testing of other filesystems +has not been thorough.</p> +</div> +<div class="paragraph"> +<p>The following limitations have been reported:</p> +</div> +<div class="ulist"> +<ul> +<li> +<p>RegionServers should be deployed in Amazon EC2 to mitigate latency and bandwidth +limitations when accessing the filesystem, and RegionServers must remain available +to preserve data locality.</p> +</li> +<li> +<p>S3 writes each inbound and outbound file to disk, which adds overhead to each operation.</p> +</li> +<li> +<p>The best performance is achieved when all clients and servers are in the Amazon +cloud, rather than a heterogenous architecture.</p> +</li> +<li> +<p>You must be aware of the location of <code>hadoop.tmp.dir</code> so that the local <code>/tmp/</code> +directory is not filled to capacity.</p> +</li> +<li> +<p>HBase has a different file usage pattern than MapReduce jobs and has been optimized for +HDFS, rather than distant networked storage.</p> +</li> +<li> +<p>The <code>s3a://</code> protocol is strongly recommended. The <code>s3n://</code> and <code>s3://</code> protocols have serious +limitations and do not use the Amazon AWS SDK. The <code>s3a://</code> protocol is supported +for use with HBase if you use Hadoop 2.6.1 or higher with HBase 1.2 or higher. Hadoop +2.6.0 is not supported with HBase at all.</p> +</li> +</ul> +</div> +<div class="paragraph"> +<p>Configuration details for Amazon S3 and associated Amazon services such as EMR are +out of the scope of the HBase documentation. See the +<a href="https://wiki.apache.org/hadoop/AmazonS3">Hadoop Wiki entry on Amazon S3 Storage</a> +and +<a href="http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html">Amazon’s documentation for deploying HBase in EMR</a>.</p> +</div> +<div class="paragraph"> +<p>One use case that is well-suited for Amazon S3 is storing snapshots. See <a href="#snapshots_s3">Storing Snapshots in an Amazon S3 Bucket</a>.</p> +</div> +</div> +</div> <h1 id="upgrading" class="sect0"><a class="anchor" href="#upgrading"></a>Upgrading</h1> <div class="openblock partintro"> <div class="content"> @@ -5741,13 +5795,13 @@ It may be possible to skip across versions — for example go fr </div> </div> <div class="sect1"> -<h2 id="hbase.versioning"><a class="anchor" href="#hbase.versioning"></a>11. HBase version number and compatibility</h2> +<h2 id="hbase.versioning"><a class="anchor" href="#hbase.versioning"></a>12. HBase version number and compatibility</h2> <div class="sectionbody"> <div class="paragraph"> <p>HBase has two versioning schemes, pre-1.0 and post-1.0. Both are detailed below.</p> </div> <div class="sect2"> -<h3 id="hbase.versioning.post10"><a class="anchor" href="#hbase.versioning.post10"></a>11.1. Post 1.0 versions</h3> +<h3 id="hbase.versioning.post10"><a class="anchor" href="#hbase.versioning.post10"></a>12.1. Post 1.0 versions</h3> <div class="paragraph"> <p>Starting with the 1.0.0 release, HBase is working towards <a href="http://semver.org/">Semantic Versioning</a> for its release versioning. In summary:</p> </div> @@ -5982,7 +6036,7 @@ It may be possible to skip across versions — for example go fr </tbody> </table> <div class="sect3"> -<h4 id="hbase.client.api.surface"><a class="anchor" href="#hbase.client.api.surface"></a>11.1.1. HBase API Surface</h4> +<h4 id="hbase.client.api.surface"><a class="anchor" href="#hbase.client.api.surface"></a>12.1.1. HBase API Surface</h4> <div class="paragraph"> <p>HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of <a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop’s Interface classification</a>. HBase’s Interface classification classes can be found <a href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/classification/package-summary.html">here</a>.</p> </div> @@ -6023,7 +6077,7 @@ It may be possible to skip across versions — for example go fr </div> </div> <div class="sect2"> -<h3 id="hbase.versioning.pre10"><a class="anchor" href="#hbase.versioning.pre10"></a>11.2. Pre 1.0 versions</h3> +<h3 id="hbase.versioning.pre10"><a class="anchor" href="#hbase.versioning.pre10"></a>12.2. Pre 1.0 versions</h3> <div class="paragraph"> <p>Before the semantic versioning scheme pre-1.0, HBase tracked either Hadoop’s versions (0.2x) or 0.9x versions. If you are into the arcane, checkout our old wiki page on <a href="http://wiki.apache.org/hadoop/Hbase/HBaseVersions">HBase Versioning</a> which tries to connect the HBase version dots. Below sections cover ONLY the releases before 1.0.</p> </div> @@ -6040,7 +6094,7 @@ It may be possible to skip across versions — for example go fr </div> </div> <div class="sect2"> -<h3 id="hbase.rolling.upgrade"><a class="anchor" href="#hbase.rolling.upgrade"></a>11.3. Rolling Upgrades</h3> +<h3 id="hbase.rolling.upgrade"><a class="anchor" href="#hbase.rolling.upgrade"></a>12.3. Rolling Upgrades</h3> <div class="paragraph"> <p>A rolling upgrade is the process by which you update the servers in your cluster a server at a time. You can rolling upgrade across HBase versions if they are binary or wire compatible. See <a href="#hbase.rolling.restart">Rolling Upgrade Between Versions that are Binary/Wire Compatible</a> for more on what this means. Coarsely, a rolling upgrade is a graceful stop each server, update the software, and then restart. You do this for each server in the cluster. Usually you upgrade the Master first and then the RegionServers. See <a href="#rolling">Rolling Restart</a> for tools that can help use the rolling upgrade process.</p> </div> @@ -6066,15 +6120,15 @@ It may be possible to skip across versions — for example go fr </div> </div> <div class="sect1"> -<h2 id="_upgrade_paths"><a class="anchor" href="#_upgrade_paths"></a>12. Upgrade Paths</h2> +<h2 id="_upgrade_paths"><a class="anchor" href="#_upgrade_paths"></a>13. Upgrade Paths</h2> <div class="sectionbody"> <div class="sect2"> -<h3 id="upgrade1.0"><a class="anchor" href="#upgrade1.0"></a>12.1. Upgrading from 0.98.x to 1.0.x</h3> +<h3 id="upgrade1.0"><a class="anchor" href="#upgrade1.0"></a>13.1. Upgrading from 0.98.x to 1.0.x</h3> <div class="paragraph"> <p>In this section we first note the significant changes that come in with 1.0.0 HBase and then we go over the upgrade process. Be sure to read the significant changes section with care so you avoid surprises.</p> </div> <div class="sect3"> -<h4 id="_changes_of_note"><a class="anchor" href="#_changes_of_note"></a>12.1.1. Changes of Note!</h4> +<h4 id="_changes_of_note"><a class="anchor" href="#_changes_of_note"></a>13.1.1. Changes of Note!</h4> <div class="paragraph"> <p>In here we list important changes that are in 1.0.0 since 0.98.x., changes you should be aware that will go into effect once you upgrade.</p> </div> @@ -6116,7 +6170,7 @@ using 0.98.11 servers with any other client version.</p> </div> </div> <div class="sect3"> -<h4 id="upgrade1.0.rolling.upgrade"><a class="anchor" href="#upgrade1.0.rolling.upgrade"></a>12.1.2. Rolling upgrade from 0.98.x to HBase 1.0.0</h4> +<h4 id="upgrade1.0.rolling.upgrade"><a class="anchor" href="#upgrade1.0.rolling.upgrade"></a>13.1.2. Rolling upgrade from 0.98.x to HBase 1.0.0</h4> <div class="admonitionblock note"> <table> <tr> @@ -6135,14 +6189,14 @@ You cannot do a <a href="#hbase.rolling.upgrade">rolling upgrade</a> from 0.96.x </div> </div> <div class="sect3"> -<h4 id="upgrade1.0.from.0.94"><a class="anchor" href="#upgrade1.0.from.0.94"></a>12.1.3. Upgrading to 1.0 from 0.94</h4> +<h4 id="upgrade1.0.from.0.94"><a class="anchor" href="#upgrade1.0.from.0.94"></a>13.1.3. Upgrading to 1.0 from 0.94</h4> <div class="paragraph"> <p>You cannot rolling upgrade from 0.94.x to 1.x.x. You must stop your cluster, install the 1.x.x software, run the migration described at <a href="#executing.the.0.96.upgrade">Executing the 0.96 Upgrade</a> (substituting 1.x.x. wherever we make mention of 0.96.x in the section below), and then restart. Be sure to upgrade your ZooKeeper if it is a version less than the required 3.4.x.</p> </div> </div> </div> <div class="sect2"> -<h3 id="upgrade0.98"><a class="anchor" href="#upgrade0.98"></a>12.2. Upgrading from 0.96.x to 0.98.x</h3> +<h3 id="upgrade0.98"><a class="anchor" href="#upgrade0.98"></a>13.2. Upgrading from 0.96.x to 0.98.x</h3> <div class="paragraph"> <p>A rolling upgrade from 0.96.x to 0.98.x works. The two versions are not binary compatible.</p> </div> @@ -6154,15 +6208,15 @@ You cannot do a <a href="#hbase.rolling.upgrade">rolling upgrade</a> from 0.96.x </div> </div> <div class="sect2"> -<h3 id="_upgrading_from_0_94_x_to_0_98_x"><a class="anchor" href="#_upgrading_from_0_94_x_to_0_98_x"></a>12.3. Upgrading from 0.94.x to 0.98.x</h3> +<h3 id="_upgrading_from_0_94_x_to_0_98_x"><a class="anchor" href="#_upgrading_from_0_94_x_to_0_98_x"></a>13.3. Upgrading from 0.94.x to 0.98.x</h3> <div class="paragraph"> <p>A rolling upgrade from 0.94.x directly to 0.98.x does not work. The upgrade path follows the same procedures as <a href="#upgrade0.96">Upgrading from 0.94.x to 0.96.x</a>. Additional steps are required to use some of the new features of 0.98.x. See <a href="#upgrade0.98">Upgrading from 0.96.x to 0.98.x</a> for an abbreviated list of these features.</p> </div> </div> <div class="sect2"> -<h3 id="upgrade0.96"><a class="anchor" href="#upgrade0.96"></a>12.4. Upgrading from 0.94.x to 0.96.x</h3> +<h3 id="upgrade0.96"><a class="anchor" href="#upgrade0.96"></a>13.4. Upgrading from 0.94.x to 0.96.x</h3> <div class="sect3"> -<h4 id="_the_singularity"><a class="anchor" href="#_the_singularity"></a>12.4.1. The "Singularity"</h4> +<h4 id="_the_singularity"><a class="anchor" href="#_the_singularity"></a>13.4.1. The "Singularity"</h4> <div class="admonitionblock note"> <table> <tr> @@ -6184,7 +6238,7 @@ Do not deploy 0.96.x Deploy at least 0.98.x. See <a href="https://issues.apache </div> </div> <div class="sect3"> -<h4 id="executing.the.0.96.upgrade"><a class="anchor" href="#executing.the.0.96.upgrade"></a>12.4.2. Executing the 0.96 Upgrade</h4> +<h4 id="executing.the.0.96.upgrade"><a class="anchor" href="#executing.the.0.96.upgrade"></a>13.4.2. Executing the 0.96 Upgrade</h4> <div class="admonitionblock note"> <table> <tr> @@ -6349,7 +6403,7 @@ Successfully completed Log splitting</pre> </div> </div> <div class="sect2"> -<h3 id="s096.migration.troubleshooting"><a class="anchor" href="#s096.migration.troubleshooting"></a>12.5. Troubleshooting</h3> +<h3 id="s096.migration.troubleshooting"><a class="anchor" href="#s096.migration.troubleshooting"></a>13.5. Troubleshooting</h3> <div id="s096.migration.troubleshooting.old.client" class="paragraph"> <div class="title">Old Client connecting to 0.96 cluster</div> <p>It will fail with an exception like the below. Upgrade.</p> @@ -6371,7 +6425,7 @@ Successfully completed Log splitting</pre> </div> </div> <div class="sect3"> -<h4 id="_upgrading_code_meta_code_to_use_protocol_buffers_protobuf"><a class="anchor" href="#_upgrading_code_meta_code_to_use_protocol_buffers_protobuf"></a>12.5.1. Upgrading <code>META</code> to use Protocol Buffers (Protobuf)</h4> +<h4 id="_upgrading_code_meta_code_to_use_protocol_buffers_protobuf"><a class="anchor" href="#_upgrading_code_meta_code_to_use_protocol_buffers_protobuf"></a>13.5.1. Upgrading <code>META</code> to use Protocol Buffers (Protobuf)</h4> <div class="paragraph"> <p>When you upgrade from versions prior to 0.96, <code>META</code> needs to be converted to use protocol buffers. This is controlled by the configuration option <code>hbase.MetaMigrationConvertingToPB</code>, which is set to <code>true</code> by default. Therefore, by default, no action is required on your part.</p> </div> @@ -6381,15 +6435,15 @@ Successfully completed Log splitting</pre> </div> </div> <div class="sect2"> -<h3 id="upgrade0.94"><a class="anchor" href="#upgrade0.94"></a>12.6. Upgrading from 0.92.x to 0.94.x</h3> +<h3 id="upgrade0.94"><a class="anchor" href="#upgrade0.94"></a>13.6. Upgrading from 0.92.x to 0.94.x</h3> <div class="paragraph"> <p>We used to think that 0.92 and 0.94 were interface compatible and that you can do a rolling upgrade between these versions but then we figured that <a href="https://issues.apache.org/jira/browse/HBASE-5357">HBASE-5357 Use builder pattern in HColumnDescriptor</a> changed method signatures so rather than return <code>void</code> they instead return <code>HColumnDescriptor</code>. This will throw <code>java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V</code> so 0.92 and 0.94 are NOT compatible. You cannot do a rolling upgrade between them.</p> </div> </div> <div class="sect2"> -<h3 id="upgrade0.92"><a class="anchor" href="#upgrade0.92"></a>12.7. Upgrading from 0.90.x to 0.92.x</h3> +<h3 id="upgrade0.92"><a class="anchor" href="#upgrade0.92"></a>13.7. Upgrading from 0.90.x to 0.92.x</h3> <div class="sect3"> -<h4 id="_upgrade_guide"><a class="anchor" href="#_upgrade_guide"></a>12.7.1. Upgrade Guide</h4> +<h4 id="_upgrade_guide"><a class="anchor" href="#_upgrade_guide"></a>13.7.1. Upgrade Guide</h4> <div class="paragraph"> <p>You will find that 0.92.0 runs a little differently to 0.90.x releases. Here are a few things to watch out for upgrading from 0.90.x to 0.92.0.</p> </div> @@ -6479,7 +6533,7 @@ Successfully completed Log splitting</pre> </div> </div> <div class="sect2"> -<h3 id="upgrade0.90"><a class="anchor" href="#upgrade0.90"></a>12.8. Upgrading to HBase 0.90.x from 0.20.x or 0.89.x</h3> +<h3 id="upgrade0.90"><a class="anchor" href="#upgrade0.90"></a>13.8. Upgrading to HBase 0.90.x from 0.20.x or 0.89.x</h3> <div class="paragraph"> <p>This version of 0.90.x HBase can be started on data written by HBase 0.20.x or HBase 0.89.x. There is no need of a migration step. HBase 0.89.x and 0.90.x does write out the name of region directories differently — it names them with a md5 hash of the region name rather than a jenkins hash — so this means that once started, there is no going back to HBase 0.20.x.</p> </div> @@ -6529,7 +6583,7 @@ Browse at least the paragraphs at the end of the help output for the gist of how </div> </div> <div class="sect1"> -<h2 id="scripting"><a class="anchor" href="#scripting"></a>13. Scripting with Ruby</h2> +<h2 id="scripting"><a class="anchor" href="#scripting"></a>14. Scripting with Ruby</h2> <div class="sectionbody"> <div class="paragraph"> <p>For examples scripting Apache HBase, look in the HBase <em>bin</em> directory. @@ -6544,7 +6598,7 @@ To run one of these files, do as follows:</p> </div> </div> <div class="sect1"> -<h2 id="_running_the_shell_in_non_interactive_mode"><a class="anchor" href="#_running_the_shell_in_non_interactive_mode"></a>14. Running the Shell in Non-Interactive Mode</h2> +<h2 id="_running_the_shell_in_non_interactive_mode"><a class="anchor" href="#_running_the_shell_in_non_interactive_mode"></a>15. Running the Shell in Non-Interactive Mode</h2> <div class="sectionbody"> <div class="paragraph"> <p>A new non-interactive mode has been added to the HBase Shell (<a href="https://issues.apache.org/jira/browse/HBASE-11658">HBASE-11658)</a>. @@ -6557,7 +6611,7 @@ If you use the normal interactive mode, the HBase Shell will only ever return it </div> </div> <div class="sect1"> -<h2 id="hbase.shell.noninteractive"><a class="anchor" href="#hbase.shell.noninteractive"></a>15. HBase Shell in OS Scripts</h2> +<h2 id="hbase.shell.noninteractive"><a class="anchor" href="#hbase.shell.noninteractive"></a>16. HBase Shell in OS Scripts</h2> <div class="sectionbody"> <div class="paragraph"> <p>You can use the HBase shell from within operating system script interpreters like the Bash shell which is the default command interpreter for most Linux and UNIX distributions. @@ -6641,7 +6695,7 @@ return $status</code></pre> </div> </div> <div class="sect2"> -<h3 id="_checking_for_success_or_failure_in_scripts"><a class="anchor" href="#_checking_for_success_or_failure_in_scripts"></a>15.1. Checking for Success or Failure In Scripts</h3> +<h3 id="_checking_for_success_or_failure_in_scripts"><a class="anchor" href="#_checking_for_success_or_failure_in_scripts"></a>16.1. Checking for Success or Failure In Scripts</h3> <div class="paragraph"> <p>Getting an exit code of <code>0</code> means that the command you scripted definitely succeeded. However, getting a non-zero exit code does not necessarily mean the command failed. @@ -6654,7 +6708,7 @@ For instance, if your script creates a table, but returns a non-zero exit value, </div> </div> <div class="sect1"> -<h2 id="_read_hbase_shell_commands_from_a_command_file"><a class="anchor" href="#_read_hbase_shell_commands_from_a_command_file"></a>16. Read HBase Shell Commands from a Command File</h2> +<h2 id="_read_hbase_shell_commands_from_a_command_file"><a class="anchor" href="#_read_hbase_shell_commands_from_a_command_file"></a>17. Read HBase Shell Commands from a Command File</h2> <div class="sectionbody"> <div class="paragraph"> <p>You can enter HBase Shell commands into a text file, one command per line, and pass that file to the HBase Shell.</p> @@ -6726,7 +6780,7 @@ COLUMN CELL </div> </div> <div class="sect1"> -<h2 id="_passing_vm_options_to_the_shell"><a class="anchor" href="#_passing_vm_options_to_the_shell"></a>17. Passing VM Options to the Shell</h2> +<h2 id="_passing_vm_options_to_the_shell"><a class="anchor" href="#_passing_vm_options_to_the_shell"></a>18. Passing VM Options to the Shell</h2> <div class="sectionbody"> <div class="paragraph"> <p>You can pass VM options to the HBase Shell using the <code>HBASE_SHELL_OPTS</code> environment variable. @@ -6743,10 +6797,10 @@ The command should be run all on a single line, but is broken by the <code>\</co </div> </div> <div class="sect1"> -<h2 id="_shell_tricks"><a class="anchor" href="#_shell_tricks"></a>18. Shell Tricks</h2> +<h2 id="_shell_tricks"><a class="anchor" href="#_shell_tricks"></a>19. Shell Tricks</h2> <div class="sectionbody"> <div class="sect2"> -<h3 id="_table_variables"><a class="anchor" href="#_table_variables"></a>18.1. Table variables</h3> +<h3 id="_table_variables"><a class="anchor" href="#_table_variables"></a>19.1. Table variables</h3> <div class="paragraph"> <p>HBase 0.95 adds shell commands that provides jruby-style object-oriented references for tables. Previously all of the shell commands that act upon a table have a procedural style that always took the name of the table as an argument. @@ -6857,7 +6911,7 @@ hbase(main):018:0></pre> </div> </div> <div class="sect2"> -<h3 id="__em_irbrc_em"><a class="anchor" href="#__em_irbrc_em"></a>18.2. <em>irbrc</em></h3> +<h3 id="__em_irbrc_em"><a class="anchor" href="#__em_irbrc_em"></a>19.2. <em>irbrc</em></h3> <div class="paragraph"> <p>Create an <em>.irbrc</em> file for yourself in your home directory. Add customizations. @@ -6876,7 +6930,7 @@ IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"</code></p </div> </div> <div class="sect2"> -<h3 id="_log_data_to_timestamp"><a class="anchor" href="#_log_data_to_timestamp"></a>18.3. LOG data to timestamp</h3> +<h3 id="_log_data_to_timestamp"><a class="anchor" href="#_log_data_to_timestamp"></a>19.3. LOG data to timestamp</h3> <div class="paragraph"> <p>To convert the date '08/08/16 20:56:29' from an hbase log into a timestamp, do:</p> </div> @@ -6901,9 +6955,9 @@ hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56 </div> </div> <div class="sect2"> -<h3 id="_debug"><a class="anchor" href="#_debug"></a>18.4. Debug</h3> +<h3 id="_debug"><a class="anchor" href="#_debug"></a>19.4. Debug</h3> <div class="sect3"> -<h4 id="_shell_debug_switch"><a class="anchor" href="#_shell_debug_switch"></a>18.4.1. Shell debug switch</h4> +<h4 id="_shell_debug_switch"><a class="anchor" href="#_shell_debug_switch"></a>19.4.1. Shell debug switch</h4> <div class="paragraph"> <p>You can set a debug switch in the shell to see more output — e.g. more of the stack trace on exception — when you run a command:</p> @@ -6915,7 +6969,7 @@ more of the stack trace on exception — when you run a command: </div> </div> <div class="sect3"> -<h4 id="_debug_log_level"><a class="anchor" href="#_debug_log_level"></a>18.4.2. DEBUG log level</h4> +<h4 id="_debug_log_level"><a class="anchor" href="#_debug_log_level"></a>19.4.2. DEBUG log level</h4> <div class="paragraph"> <p>To enable DEBUG level logging in the shell, launch it with the <code>-d</code> option.</p> </div> @@ -6927,9 +6981,9 @@ more of the stack trace on exception — when you run a command: </div> </div> <div class="sect2"> -<h3 id="_commands"><a class="anchor" href="#_commands"></a>18.5. Commands</h3> +<h3 id="_commands"><a class="anchor" href="#_commands"></a>19.5. Commands</h3> <div class="sect3"> -<h4 id="_count"><a class="anchor" href="#_count"></a>18.5.1. count</h4> +<h4 id="_count"><a class="anchor" href="#_count"></a>19.5.1. count</h4> <div class="paragraph"> <p>Count command returns the number of rows in a table. It’s quite fast when configured with the right CACHE</p> @@ -7002,7 +7056,7 @@ By default, the timestamp represents the time on the RegionServer when the data </div> </div> <div class="sect1"> -<h2 id="conceptual.view"><a class="anchor" href="#conceptual.view"></a>19. Conceptual View</h2> +<h2 id="conceptual.view"><a class="anchor" href="#conceptual.view"></a>20. Conceptual View</h2> <div class="sectionbody"> <div class="paragraph"> <p>You can read a very understandable explanation of the HBase data model in the blog post <a href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding HBase and BigTable</a> by Jim R. Wilson. @@ -7129,7 +7183,7 @@ This is only a mock-up for illustrative purposes and may not be strictly accurat </div> </div> <div class="sect1"> -<h2 id="physical.view"><a class="anchor" href="#physical.view"></a>20. Physical View</h2> +<h2 id="physical.view"><a class="anchor" href="#physical.view"></a>21. Physical View</h2> <div class="sectionbody"> <div class="paragraph"> <p>Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family. @@ -7208,7 +7262,7 @@ Thus a request for the values of all columns in the row <code>com.cnn.www</code> </div> </div> <div class="sect1"> -<h2 id="_namespace"><a class="anchor" href="#_namespace"></a>21. Namespace</h2> +<h2 id="_namespace"><a class="anchor" href="#_namespace"></a>22. Namespace</h2> <div class="sectionbody"> <div class="paragraph"> <p>A namespace is a logical grouping of tables analogous to a database in relation database systems. @@ -7228,7 +7282,7 @@ This abstraction lays the groundwork for upcoming multi-tenancy related features </ul> </div> <div class="sect2"> -<h3 id="namespace_creation"><a class="anchor" href="#namespace_creation"></a>21.1. Namespace management</h3> +<h3 id="namespace_creation"><a class="anchor" href="#namespace_creation"></a>22.1. Namespace management</h3> <div class="paragraph"> <p>A namespace can be created, removed or altered. Namespace membership is determined during table creation by specifying a fully-qualified table name of the form:</p> @@ -7269,7 +7323,7 @@ alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VA </div> </div> <div class="sect2"> -<h3 id="namespace_special"><a class="anchor" href="#namespace_special"></a>21.2. Predefined namespaces</h3> +<h3 id="namespace_special"><a class="anchor" href="#namespace_special"></a>22.2. Predefined namespaces</h3> <div class="paragraph"> <p>There are two predefined special namespaces:</p> </div> @@ -7301,7 +7355,7 @@ create 'bar', 'fam'</code></pre> </div> </div> <div class="sect1"> -<h2 id="_table"><a class="anchor" href="#_table"></a>22. Table</h2> +<h2 id="_table"><a class="anchor" href="#_table"></a>23. Table</h2> <div class="sectionbody"> <div class="paragraph"> <p>Tables are declared up front at schema definition time.</p> @@ -7309,7 +7363,7 @@ create 'bar', 'fam'</code></pre> </div> </div> <div class="sect1"> -<h2 id="_row"><a class="anchor" href="#_row"></a>23. Row</h2> +<h2 id="_row"><a class="anchor" href="#_row"></a>24. Row</h2> <div class="sectionbody"> <div class="paragraph"> <p>Row keys are uninterpreted bytes. @@ -7319,7 +7373,7 @@ The empty byte array is used to denote both the start and end of a tables' names </div> </div> <div class="sect1"> -<h2 id="columnfamily"><a class="anchor" href="#columnfamily"></a>24. Column Family</h2> +<h2 id="columnfamily"><a class="anchor" href="#columnfamily"></a>25. Column Family</h2> <div class="sectionbody"> <div class="paragraph"> <p>Columns in Apache HBase are grouped into <em>column families</em>. @@ -7337,7 +7391,7 @@ Because tunings and storage specifications are done at the column family level, </div> </div> <div class="sect1"> -<h2 id="_cells"><a class="anchor" href="#_cells"></a>25. Cells</h2> +<h2 id="_cells"><a class="anchor" href="#_cells"></a>26. Cells</h2> <div class="sectionbody"> <div class="paragraph"> <p>A <em>{row, column, version}</em> tuple exactly specifies a <code>cell</code> in HBase. @@ -7346,27 +7400,27 @@ Cell content is uninterpreted bytes</p> </div> </div> <div class="sect1"> -<h2 id="_data_model_operations"><a class="anchor" href="#_data_model_operations"></a>26. Data Model Operations</h2> +<h2 id="_data_model_operations"><a class="anchor" href="#_data_model_operations"></a>27. Data Model Operations</h2> <div class="sectionbody"> <div class="paragraph"> <p>The four primary data model operations are Get, Put, Scan, and Delete. Operations are applied via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</a> instances.</p> </div> <div class="sect2"> -<h3 id="_get"><a class="anchor" href="#_get"></a>26.1. Get</h3> +<h3 id="_get"><a class="anchor" href="#_get"></a>27.1. Get</h3> <div class="paragraph"> <p><a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</a> returns attributes for a specified row. Gets are executed via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)">Table.get</a>.</p> </div> </div> <div class="sect2"> -<h3 id="_put"><a class="anchor" href="#_put"></a>26.2. Put</h3> +<h3 id="_put"><a class="anchor" href="#_put"></a>27.2. Put</h3> <div class="paragraph"> <p><a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</a> either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)">Table.put</a> (writeBuffer) or <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List,%20java.lang.Object%5B%5D)">Table.batch</a> (non-writeBuffer).</p> </div> </div> <div class="sect2"> -<h3 id="scan"><a class="anchor" href="#scan"></a>26.3. Scans</h3> +<h3 id="scan"><a class="anchor" href="#scan"></a>27.3. Scans</h3> <div class="paragraph"> <p><a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</a> allow iteration over multiple rows for specified attributes.</p> </div> @@ -7400,7 +7454,7 @@ ResultScanner rs = table.getScanner(scan); </div> </div> <div class="sect2"> -<h3 id="_delete"><a class="anchor" href="#_delete"></a>26.4. Delete</h3> +<h3 id="_delete"><a class="anchor" href="#_delete"></a>27.4. Delete</h3> <div class="paragraph"> <p><a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html">Delete</a> removes a row from a table. Deletes are executed via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)">Table.delete</a>.</p> @@ -7416,7 +7470,7 @@ These tombstones, along with the dead values, are cleaned up on major compaction </div> </div> <div class="sect1"> -<h2 id="versions"><a class="anchor" href="#versions"></a>27. Versions</h2> +<h2 id="versions"><a class="anchor" href="#versions"></a>28. Versions</h2> <div class="sectionbody"> <div class="paragraph"> <p>A <em>{row, column, version}</em> tuple exactly specifies a <code>cell</code> in HBase. @@ -7451,7 +7505,7 @@ As of this writing, the limitation <em>Overwriting values at existing timestamps This section is basically a synopsis of this article by Bruno Dumon.</p> </div> <div class="sect2"> -<h3 id="specify.number.of.versions"><a class="anchor" href="#specify.number.of.versions"></a>27.1. Specifying the Number of Versions to Store</h3> +<h3 id="specify.number.of.versions"><a class="anchor" href="#specify.number.of.versions"></a>28.1. Specifying the Number of Versions to Store</h3> <div class="paragraph"> <p>The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an <code>alter</code> command, via <code>HColumnDescriptor.DEFAULT_VERSIONS</code>. Prior to HBase 0.96, the default number of versions kept was <code>3</code>, but in 0.96 and newer has been changed to <code>1</code>.</p> @@ -7492,12 +7546,12 @@ See <a href="#hbase.column.max.version">hbase.column.max.version</a>.</p> </div> </div> <div class="sect2"> -<h3 id="versions.ops"><a class="anchor" href="#versions.ops"></a>27.2. Versions and HBase Operations</h3> +<h3 id="versions.ops"><a class="anchor" href="#versions.ops"></a>28.2. Versions and HBase Operations</h3> <div class="paragraph"> <p>In this section we look at the behavior of the version dimension for each of the core HBase operations.</p> </div> <div class="sect3"> -<h4 id="_get_scan"><a class="anchor" href="#_get_scan"></a>27.2.1. Get/Scan</h4> +<h4 id="_get_scan"><a class="anchor" href="#_get_scan"></a>28.2.1. Get/Scan</h4> <div class="paragraph"> <p>Gets are implemented on top of Scans. The below discussion of <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</a> applies equally to <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scans</a>.</p> @@ -7520,7 +7574,7 @@ The below discussion of <a href="http://hbase.apache.org/apidocs/org/apache/hado </div> </div> <div class="sect3"> -<h4 id="_default_get_example"><a class="anchor" href="#_default_get_example"></a>27.2.2. Default Get Example</h4> +<h4 id="_default_get_example"><a class="anchor" href="#_default_get_example"></a>28.2.2. Default Get Example</h4> <div class="paragraph"> <p>The following Get will only retrieve the current version of the row</p> </div> @@ -7536,7 +7590,7 @@ Get get = <span class="keyword">new</span> Get(Bytes.toBytes(<span class="string </div> </div> <div class="sect3"> -<h4 id="_versioned_get_example"><a class="anchor" href="#_versioned_get_example"></a>27.2.3. Versioned Get Example</h4> +<h4 id="_versioned_get_example"><a class="anchor" href="#_versioned_get_example"></a>28.2.3. Versioned Get Example</h4> <div class="paragraph"> <p>The following Get will return the last 3 versions of the row.</p> </div> @@ -7554,7 +7608,7 @@ get.setMaxVersions(<span class="integer">3</span>); <span class="comment">// wi </div> </div> <div class="sect3"> -<h4 id="_put_2"><a class="anchor" href="#_put_2"></a>27.2.4. Put</h4> +<h4 id="_put_2"><a class="anchor" href="#_put_2"></a>28.2.4. Put</h4> <div class="paragraph"> <p>Doing a put always creates a new version of a <code>cell</code>, at a certain timestamp. By default the system uses the server’s <code>currentTimeMillis</code>, but you can specify the version (= the long integer) yourself, on a per-column level. @@ -7603,7 +7657,7 @@ Prefer using a separate timestamp attribute of the row, or have the timestamp as </div> </div> <div class="sect3"> -<h4 id="version.delete"><a class="anchor" href="#version.delete"></a>27.2.5. Delete</h4> +<h4 id="version.delete"><a class="anchor" href="#version.delete"></a>28.2.5. Delete</h4> <div class="paragraph"> <p>There are three different types of internal delete markers. See Lars Hofhansl’s blog for discussion of his attempt adding another, <a href="http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html">Scanning in HBase: Prefix Delete Marker</a>.</p> @@ -7662,9 +7716,9 @@ The change has been backported to HBase 0.94 and newer branches. </div> </div> <div class="sect2"> -<h3 id="_current_limitations"><a class="anchor" href="#_current_limitations"></a>27.3. Current Limitations</h3> +<h3 id="_current_limitations"><a class="anchor" href="#_current_limitations"></a>28.3. Current Limitations</h3> <div class="sect3"> -<h4 id="_deletes_mask_puts"><a class="anchor" href="#_deletes_mask_puts"></a>27.3.1. Deletes mask Puts</h4> +<h4 id="_deletes_mask_puts"><a class="anchor" href="#_deletes_mask_puts"></a>28.3.1. Deletes mask Puts</h4> <div class="paragraph"> <p>Deletes mask puts, even puts that happened after the delete was entered. See <a href="https://issues.apache.org/jira/browse/HBASE-2256">HBASE-2256</a>. @@ -7679,7 +7733,7 @@ But they can occur even if you do not care about time: just do delete and put im </div> </div> <div class="sect3"> -<h4 id="major.compactions.change.query.results"><a class="anchor" href="#major.compactions.change.query.results"></a>27.3.2. Major compactions change query results</h4> +<h4 id="major.compactions.change.query.results"><a class="anchor" href="#major.compactions.change.query.results"></a>28.3.2. Major compactions change query results</h4> <div class="paragraph"> <p><em>…​create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be @@ -7692,7 +7746,7 @@ But they can occur even if you do not care about time: just do delete and put im </div> </div> <div class="sect1"> -<h2 id="dm.sort"><a class="anchor" href="#dm.sort"></a>28. Sort Order</h2> +<h2 id="dm.sort"><a class="anchor" href="#dm.sort"></a>29. Sort Order</h2> <div class="sectionbody"> <div class="paragraph"> <p>All data model operations HBase return data in sorted order. @@ -7701,7 +7755,7 @@ First by row, then by ColumnFamily, followed by column qualifier, and finally ti </div> </div> <div class="sect1"> -<h2 id="dm.column.metadata"><a class="anchor" href="#dm.column.metadata"></a>29. Column Metadata</h2> +<h2 id="dm.column.metadata"><a class="anchor" href="#dm.column.metadata"></a>30. Column Metadata</h2> <div class="sectionbody"> <div class="paragraph"> <p>There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily. @@ -7714,7 +7768,7 @@ For more information about how HBase stores data internally, see <a href="#keyva </div> </div> <div class="sect1"> -<h2 id="joins"><a class="anchor" href="#joins"></a>30. Joins</h2> +<h2 id="joins"><a class="anchor" href="#joins"></a>31. Joins</h2> <div class="sectionbody"> <div class="paragraph"> <p>Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn’t, at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated in this chapter, the read data model operations in HBase are Get and Scan.</p> @@ -7727,7 +7781,7 @@ hash-joins). So which is the best approach? It depends on what you are trying to </div> </div> <div class="sect1"> -<h2 id="_acid"><a class="anchor" href="#_acid"></a>31. ACID</h2> +<h2 id="_acid"><a class="anchor" href="#_acid"></a>32. ACID</h2> <div class="sectionbody"> <div class="paragraph"> <p>See <a href="/acid-semantics.html">ACID Semantics</a>. @@ -7756,7 +7810,7 @@ modeling on HBase.</p> </div> </div> <div class="sect1"> -<h2 id="schema.creation"><a class="anchor" href="#schema.creation"></a>32. Schema Creation</h2> +<h2 id="schema.creation"><a class="anchor" href="#schema.creation"></a>33. Schema Creation</h2> <div class="sectionbody"> <div class="paragraph"> <p>HBase schemas can be created or updated using the <a href="#shell">The Apache HBase Shell</a> or by using <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html">Admin</a> in the Java API.</p> @@ -7796,7 +7850,7 @@ online schema changes are supported in the 0.92.x codebase, but the 0.90.x codeb </table> </div> <div class="sect2"> -<h3 id="schema.updates"><a class="anchor" href="#schema.updates"></a>32.1. Schema Updates</h3> +<h3 id="schema.updates"><a class="anchor" href="#schema.updates"></a>33.1. Schema Updates</h3> <div class="paragraph"> <p>When changes are made to either Tables or ColumnFamilies (e.g. region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.</p> </div> @@ -7807,7 +7861,7 @@ online schema changes are supported in the 0.92.x codebase, but the 0.90.x codeb </div> </div> <div class="sect1"> -<h2 id="table_schema_rules_of_thumb"><a class="anchor" href="#table_schema_rules_of_thumb"></a>33. Table Schema Rules Of Thumb</h2> +<h2 id="table_schema_rules_of_thumb"><a class="anchor" href="#table_schema_rules_of_thumb"></a>34. Table Schema Rules Of Thumb</h2> <div class="sectionbody"> <div class="paragraph"> <p>There are many different data sets, with different access patterns and service-level @@ -7880,7 +7934,7 @@ defaults).</p> </div> </div> <div class="sect1"> -<h2 id="number.of.cfs"><a class="anchor" href="#number.of.cfs"></a>34. On the number of column families</h2> +<h2 id="number.of.cfs"><a class="anchor" href="#number.of.cfs"></a>35. On the number of column families</h2> <div class="sectionbody"> <div class="paragraph"> <p>HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. @@ -7893,7 +7947,7 @@ Only introduce a second and third column family in the case where data access is you query one column family or the other but usually not both at the one time.</p> </div> <div class="sect2"> -<h3 id="number.of.cfs.card"><a class="anchor" href="#number.of.cfs.card"></a>34.1. Cardinality of ColumnFamilies</h3> +<h3 id="number.of.cfs.card"><a class="anchor" href="#number.of.cfs.card"></a>35.1. Cardinality of ColumnFamilies</h3> <div class="paragraph"> <p>Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.</p> </div> @@ -7901,10 +7955,10 @@ you query one column family or the other but usually not both at the one time.</ </div> </div> <div class="sect1"> -<h2 id="rowkey.design"><a class="anchor" href="#rowkey.design"></a>35. Rowkey Design</h2> +<h2 id="rowkey.design"><a class="anchor" href="#rowkey.design"></a>36. Rowkey Design</h2> <div class="sectionbody"> <div class="sect2"> -<h3 id="_hotspotting"><a class="anchor" href="#_hotspotting"></a>35.1. Hotspotting</h3> +<h3 id="_hotspotting"><a class="anchor" href="#_hotspotting"></a>36.1. Hotspotting</h3> <div class="paragraph"> <p>Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. @@ -8000,7 +8054,7 @@ This effectively randomizes row keys, but sacrifices row ordering properties.</p </div> </div> <div class="sect2"> -<h3 id="timeseries"><a class="anchor" href="#timeseries"></a>35.2. Monotonically Increasing Row Keys/Timeseries Data</h3> +<h3 id="timeseries"><a class="anchor" href="#timeseries"></a>36.2. Monotonically Increasing Row Keys/Timeseries Data</h3> <div class="paragraph"> <p>In the HBase chapter of Tom White’s book <a href="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</a> (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. @@ -8019,7 +8073,7 @@ Thus, even with a continual stream of input data with a mix of metric types, the </div> </div> <div class="sect2"> -<h3 id="keysize"><a class="anchor" href="#keysize"></a>35.3. Try to minimize row and column sizes</h3> +<h3 id="keysize"><a class="anchor" href="#keysize"></a>36.3. Try to minimize row and column sizes</h3> <div class="paragraph"> <p>In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. @@ -8036,7 +8090,7 @@ Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they <p>See <a href="#keyvalue">keyvalue</a> for more information on HBase stores data internally to see why this is important.</p> </div> <div class="sect3"> -<h4 id="keysize.cf"><a class="anchor" href="#keysize.cf"></a>35.3.1. Column Families</h4> +<h4 id="keysize.cf"><a class="anchor" href="#keysize.cf"></a>36.3.1. Column Families</h4> <div class="paragraph"> <p>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).</p> </div> @@ -8045,7 +8099,7 @@ Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they </div> </div> <div class="sect3"> -<h4 id="keysize.attributes"><a class="anchor" href="#keysize.attributes"></a>35.3.2. Attributes</h4> +<h4 id="keysize.attributes"><a class="anchor" href="#keysize.attributes"></a>36.3.2. Attributes</h4> <div class="paragraph"> <p>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.</p> </div> @@ -8054,7 +8108,7 @@ Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they </div> </div> <div class="sect3"> -<h4 id="keysize.row"><a class="anchor" href="#keysize.row"></a>35.3.3. Rowkey Length</h4> +<h4 id="keysize.row"><a class="anchor" href="#keysize.row"></a>36.3.3. Rowkey Length</h4> <div class="paragraph"> <p>Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. @@ -8062,7 +8116,7 @@ Expect tradeoffs when designing rowkeys.</p> </div> </div> <div class="sect3"> -<h4 id="keysize.patterns"><a class="anchor" href="#keysize.patterns"></a>35.3.4. Byte Patterns</h4> +<h4 id="keysize.patterns"><a class="anchor" href="#keysize.patterns"></a>36.3.4. Byte Patterns</h4> <div class="paragraph"> <p>A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. @@ -8118,7 +8172,7 @@ This is the main trade-off.</p> </div> </div> <div class="sect2"> -<h3 id="reverse.timestamp"><a class="anchor" href="#reverse.timestamp"></a>35.4. Reverse Timestamps</h3> +<h3 id="reverse.timestamp"><a class="anchor" href="#reverse.timestamp"></a>36.4. Reverse Timestamps</h3> <div class="admonitionblock note"> <table> <tr> @@ -8150,14 +8204,14 @@ Since HBase keys are in sorted order, this key sorts before any older row-keys f </div> </div> <div class="sect2"> -<h3 id="rowkey.scope"><a class="anchor" href="#rowkey.scope"></a>35.5. Rowkeys and ColumnFamilies</h3> +<h3 id="rowkey.scope"><a class="anchor" href="#rowkey.scope"></a>36.5. Rowkeys and ColumnFamilies</h3> <div class="paragraph"> <p>Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.</p> </div> </div> <div class="sect2"> -<h3 id="changing.rowkeys"><a class="anchor" href="#changing.rowkeys"></a>35.6. Immutability of Rowkeys</h3> +<h3 id="changing.rowkeys"><a class="anchor" href="#changing.rowkeys"></a>36.6. Immutability of Rowkeys</h3> <div class="paragraph"> <p>Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. @@ -8165,7 +8219,7 @@ This is a fairly common question on the HBase dist-list so it pays to get the ro </div> </div> <div class="sect2"> -<h3 id="rowkey.regionsplits"><a class="anchor" href="#rowkey.regionsplits"></a>35.7. Relationship Between RowKeys and Region Splits</h3> +<h3 id="rowkey.regionsplits"><a class="anchor" href="#rowkey.regionsplits"></a>36.7. Relationship Between RowKeys and Region Splits</h3> <div class="paragraph"> <p>If you pre-split your table, it is <em>critical</em> to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through <code>Bytes.split</code> (which is the split strategy used when creating regions in <code>Admin.createTable(byte[] startKey, byte[] endKey, numRegions)</code> for 10 regions will generate the following splits…​</p> @@ -8237,10 +8291,10 @@ Know your data.</p> </div> </div> <div class="sect1"> -<h2 id="schema.versions"><a class="anchor" href="#schema.versions"></a>36. Number of Versions</h2> +<h2 id="schema.versions"><a class="anchor" href="#schema.versions"></a>37. Number of Versions</h2> <div class="sectionbody"> <div class="sect2"> -<h3 id="schema.versions.max"><a class="anchor" href="#schema.versions.max"></a>36.1. Maximum Number of Versions</h3> +<h3 id="schema.versions.max"><a class="anchor" href="#schema.versions.max"></a>37.1. Maximum Number of Versions</h3> <div class="paragraph"> <p>The maximum number of row versions to store is configured per column family via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</a>. The default for max versions is 1. @@ -8252,7 +8306,7 @@ The number of max versions may need to be increased or decreased depending on ap </div> </div> <div class="sect2"> -<h3 id="schema.minversions"><a class="anchor" href="#schema.minversions"></a>36.2. Minimum Number of Versions</h3> +<h3 id="schema.minversions"><a class="anchor" href="#schema.minversions"></a>37.2. Minimum Number of Versions</h3> <div class="paragraph"> <p>Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</a>. The default for min versions is 0, which means the feature is disabled. @@ -8262,7 +8316,7 @@ The minimum number of row versions parameter is used together with the time-to-l </div> </div> <div class="sect1"> -<h2 id="supported.datatypes"><a class="anchor" href="#supported.datatypes"></a>37. Supported Datatypes</h2> +<h2 id="supported.datatypes"><a class="anchor" href="#supported.datatypes"></a>38. Supported Datatypes</h2> <div class="sectionbody"> <div class="paragraph"> <p>HBase supports a "bytes-in/bytes-out" interface via <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</a> and <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html">Result</a>, so anything that can be converted to an array of bytes can be stored as a value. @@ -8274,7 +8328,7 @@ All rows in HBase conform to the <a href="#datamodel">Data Model</a>, and that i Take that into consideration when making your design, as well as block size for the ColumnFamily.</p> </div> <div class="sect2"> -<h3 id="_counters"><a class="anchor" href="#_counters"></a>37.1. Counters</h3> +<h3 id="_counters"><a class="anchor" href="#_counters"></a>38.1. Counters</h3> <div class="paragraph"> <p>One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See <a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#increment%28org.apache.hadoop.hbase.client.Increment%29">Increment</a> in <code>Table</code>.</p> </div> @@ -8285,7 +8339,7 @@ Take that into consideration when making your design, as well as block size for </div> </div> <div class="sect1"> -<h2 id="schema.joins"><a class="anchor" href="#schema.joins"></a>38. Joins</h2> +<h2 id="schema.joins"><a class="anchor" href="#schema.joins"></a>39. Joins</h2> <div class="sectionbody"> <div class="paragraph"> <p>If you have multiple tables, don’t forget to factor in the potential for <a href="#joins">Joins</a> into the schema design.</p> @@ -8293,7 +8347,7 @@ Take that into consideration when making your design, as well as block size for </div> </div> <div class="sect1"> -<h2 id="ttl"><a class="anchor" href="#ttl"></a>39. Time To Live (TTL)</h2> +<h2 id="ttl"><a class="anchor" href="#ttl"></a>40. Time To Live (TTL)</h2> <div class="sectionbody"> <div class="paragraph"> <p>ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. @@ -8328,7 +8382,7 @@ There are two notable differences between cell TTL handling and ColumnFamily TTL </div> </div> <div class="sect1"> -<h2 id="cf.keep.deleted"><a class="anchor" href="#cf.keep.deleted"></a>40. Keeping Deleted Cells</h2> +<h2 id="cf.keep.deleted"><a class="anchor" href="#cf.keep.deleted"></a>41. Keeping Deleted Cells</h2> <div class="sectionbody"> <div class="paragraph"> <p>By default, delete markers extend back to the beginning of time. @@ -8469,7 +8523,7 @@ So with KEEP_DELETED_CELLS enabled deleted cells would get removed if either you </div> </div> <div class="sect1"> -<h2 id="secondary.indexes"><a class="anchor" href="#secondary.indexes"></a>41. Secondary Indexes and Alternate Query Paths</h2> +<h2 id="secondary.indexes"><a class="anchor" href="#secondary.indexes"></a>42. Secondary Indexes and Alternate Query Paths</h2> <div class="sectionbody"> <div class="paragraph"> <p>This section could also be titled "what if my table rowkey looks like <em>this</em> but I also want to query my table like <em>that</em>." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges. @@ -8512,7 +8566,7 @@ However, HBase scales better at larger data volumes, so this is a feature trade- <p>Additionally, see the David Butler response in this dist-list thread <a href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&subj=Stargate+hbase">HBase, mail # user - Stargate+hbase</a></p> </div> <div class="sect2"> -<h3 id="secondary.indexes.filter"><a class="anchor" href="#secondary.indexes.filter"></a>41.1. Filter Query</h3> +<h3 id="secondary.indexes.filter"><a class="anchor" href="#secondary.indexes.filter"></a>42.1. Filter Query</h3> <div class="paragraph"> <p>Depending on the case, it may be appropriate to use <a href="#client.filter">Client Request Filters</a>. In this case, no secondary index is created. @@ -8520,7 +8574,7 @@ However, don’t try a full-scan on a large table like this from an applicat </div> </div> <div class="sect2"> -<h3 id="secondary.indexes.periodic"><a class="anchor" href="#secondary.indexes.periodic"></a>41.2. Periodic-Update Secondary Index</h3> +<h3 id="secondary.indexes.periodic"><a class="anchor" href="#secondary.indexes.periodic"></a>42.2. Periodic-Update Secondary Index</h3> <div class="paragraph"> <p>A secondary index could be created in another table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.</p> @@ -8530,13 +8584,13 @@ The job could be executed intra-day, but depending on load-strategy it could sti </div> </div> <div class="sect2"> -<h3 id="secondary.indexes.dualwrite"><a class="anchor" href="#secondary.indexes.dualwrite"></a>41.3. Dual-Write Secondary Index</h3> +<h3 id="secondary.indexes.dualwrite"><a class="anchor" href="#secondary.indexes.dualwrite"></a>42.3. Dual-Write Secondary Index</h3> <div class="paragraph"> <p>Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see <a href="#secondary.indexes.periodic">secondary.indexes.periodic</a>).</p> </div> </div> <div class="sect2"> -<h3 id="secondary.indexes.summary"><a class="anchor" href="#secondary.indexes.summary"></a>41.4. Summary Tables</h3> +<h3 id="secondary.indexes.summary"><a class="anchor" href="#secondary.indexes.summary"></a>42.4. Summary Tables</h3> <div class="paragraph"> <p>Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.</p> @@ -8546,7 +8600,7 @@ These would be generated with MapReduce jobs into another table.</p> </div> </div> <div class="sect2"> -<h3 id="secondary.indexes.coproc"><a class="anchor" href="#secondary.indexes.coproc"></a>41.5. Coprocessor Secondary Index</h3> +<h3 id="secondary.indexes.coproc"><a class="anchor" href="#secondary.indexes.coproc"></a>42.5. Coprocessor Secondary Index</h3> <div class="paragraph"> <p>Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, see <a href="#cp">coprocessors</a></p> @@ -8555,7 +8609,7 @@ For more information, see <a href="#cp">coprocessors</a></p> </div> </div> <div class="sect1"> -<h2 id="_constraints"><a class="anchor" href="#_constraints"></a>42. Constraints</h2> +<h2 id="_constraints"><a class="anchor" href="#_constraints"></a>43. Constraints</h2> <div class="sectionbody"> <div class="paragraph"> <p>HBase currently supports 'constraints' in traditional (SQL) database parlance. @@ -8570,7 +8624,7 @@ since version 0.94.</p> </div> </div> <div class="sect1"> -<h2 id="schema.casestudies"><a class="anchor" href="#schema.casestudies"></a>43. Schema Design Case Studies</h2> +<h2 id="schema.casestudies"><a class="anchor" href="#schema.casestudies"></a>44. Schema Design Case Studies</h2> <div class="sectionbody"> <div class="paragraph"> <p>The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. @@ -8603,7 +8657,7 @@ Know your data, and know your processing requirements.</p> </ul> </div> <div class="sect2"> -<h3 id="schema.casestudies.log_timeseries"><a class="anchor" href="#schema.casestudies.log_timeseries"></a>43.1. Case Study - Log Data and Timeseries Data</h3> +<h3 id="schema.casestudies.log_timeseries"><a class="anchor" href="#schema.casestudies.log_timeseries"></a>44.1. Case Study - Log Data and Timeseries Data</h3> <div class="paragraph"> <p>Assume that the following data elements are being collected.</p> </div> @@ -8627,7 +8681,7 @@ Know your data, and know your processing requirements.</p> <p>We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?</p> </div> <div class="sect3"> -<h4 id="schema.casestudies.log_timeseries.tslead"><a class="anchor" href="#schema.casestudies.log_timeseries.tslead"></a>43.1.1. Timestamp In The Rowkey Lead Position</h4> +<h4 id="schema.casestudies.log_timeseries.tslead"><a class="anchor" href="#schema.casestudies.log_timeseries.tslead"></a>44.1.1. Timestamp In The Rowkey Lead Position</h4> <div class="paragraph"> <p>The rowkey <code>[timestamp][hostname][log-event]</code> suffers from the monotonically increasing rowkey problem described in <a href="#timeseries">Monotonically Increasing Row Keys/Timeseries Data</a>.</p> </div> @@ -8655,14 +8709,14 @@ Attention must be paid to the number of buckets, because this will require the s </div> </div> <div class="sect3"> -<h4 id="schema.casestudies.log_timeseries.hostlead"><a class="anchor" href="#schema.casestudies.log_timeseries.hostlead"></a>43.1.2. Host In The Rowkey Lead Position</h4> +<h4 id="schema.casestudies.log_timeseries.hostlead"><a class="anchor" href="#schema.casestudies.log_timeseries.hostlead"></a>44.1.2. Host In The Rowkey Lead Position</h4> <div class="paragraph"> <p>The rowkey <code>[hostname][log-event][timestamp]</code> is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace. This approach would be useful if scanning by hostname was a priority.</p> </div> </div> <div class="sect3"> -<h4 id="schema.casestudies.log_timeseries.revts"><a class="anchor" href="#schema.casestudies.log_timeseries.revts"></a>43.1.3. Timestamp, or Reverse Timestamp?</h4> +<h4 id="schema.casestudies.log_timeseries.revts"><a class="anchor" href="#schema.casestudies.log_timeseries.revts"></a>44.1.3. Timestamp, or Reverse Timestamp?</h4> <div class="paragraph"> <p>If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., <code>timestamp = Long.MAX_VALUE â timestamp</code>) will create the property of being able to do a Scan on <code>[hostname][log-event]</code> to obtain the quickly obtain the most recently captured events.</p> </div> @@ -8688,7 +8742,7 @@ See <a href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Sca </div> </div> <div class="sect3"> -<h4 id="schema.casestudies.log_timeseries.varkeys"><a class="anchor" href="#schema.casestudies.log_timeseries.varkeys"></a>43.1.4. Variable Length or Fixed Length Rowkeys?</h4> +<h4 id="schema.casestudies.log_timeseries.varkeys"><a class="anchor" href="#schema.casestudies.log_timeseries.varkeys"></a>44.1.4. Variable Length or Fixed Length Rowkeys?</h4> <div class="paragraph"> <p>It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is <code>a</code> and the event type is <code>e1</code> then the resulting rowkey would be quite small. @@ -8759,7 +8813,7 @@ by using an </div> </div> <div class="sect2"> -<h3 id="schema.casestudies.log_steroids"><a class="anchor" href="#schema.casestudies.log_steroids"></a>43.2. Case Study - Log Data and Timeseries Data on Steroids</h3> +<h3 id="schema.casestudies.log_steroids"><a class="anchor" href="#schema.casestudies.log_steroids"></a>44.2. Case Study - Log Data and Timeseries Data on Steroids</h3> <div class="paragraph"> <p>This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. @@ -8790,7 +8844,7 @@ from HBaseCon2012.</p> </div> </div> <div class="sect2"> -<h3 id="schema.casestudies.custorder"><a class="anchor" href="#schema.casestudies.custorder"></a>43.3. Case Study - Customer/Order</h3> +<h3 id="schema.casestudies.custorder"><a class="anchor" href="#schema.casestudies.custorder"></a>44.3. Case Study - Customer/Order</h3> <div class="paragraph"> <p>Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.</p> @@ -8876,7 +8930,7 @@ What is the keyspace of the customer number, and what is the format (e.g., numer </ul> </div> <div class="sect3"> -<h4 id="schema.casestudies.custorder.tables"><a class="anchor" href="#schema.casestudies.custorder.tables"></a>43.3.1. Single Table? Multiple Tables?</h4> +<h4 id="schema.casestudies.custorder.tables"><a class="anchor" href="#schema.casestudies.custorder.tables"></a>44.3.1. Single Table? Multiple Tables?</h4> <div class="paragraph"> <p>A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).</p> @@ -8915,7 +8969,7 @@ Another option is to pack multiple record types into a single table (e.g., CUSTO </div> </div> <div class="sect3"> -<h4 id="schema.casestudies.custorder.obj"><a class="anchor" href="#schema.casestudies.custorder.obj"></a>43.3.2. Order Object Design</h4> +<h4 id="schema.casestudies.custorder.obj"><a class="anchor" href="#schema.casestudies.custorder.obj"></a>44.3.2. Order Object Design</h4> <div class="paragraph"> <p>Now we need to address how to model the Order object. Assume that the class structure is as follows:</p> @@ -9121,13 +9175,13 @@ Care should be taken with this approach to ensure backward compatibility in case </div> </div> <div class="sect2"> -<h3 id="schema.smackdown"><a class="anchor" href="#schema.smackdown"></a>43.4. Case Study - "Tall/Wide/Middle" Schema Design Smackdown</h3> +<h3 id="schema.smackdown"><a class="anchor" href="#schema.smackdown"></a>44.4. Case Study - "Tall/Wide/Middle" Schema Design Smackdown</h3> <div class="paragraph"> <p>This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.</p> </div> <div class="sect3"> -<h4 id="schema.smackdown.rowsversions"><a class="anchor" href="#schema.smackdown.rowsversions"></a>43.4.1. Rows vs. Versions</h4> +<h4 id="schema.smackdown.rowsversions"><a class="anchor" href="#schema.smackdown.rowsversions"></a>44.4.1. Rows vs. Versions</h4> <div class="paragraph"> <p>A common question is whether one should prefer rows or HBase’s built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwrite with each successive update.</p> @@ -9137,7 +9191,7 @@ The context is typically where there are "a lot" of versions of a row to be reta </div> </div> <div class="sect3"> -<h4 id="schema.smackdown.rowscols"><a class="anchor" href="#schema.smackdown.rowscols"></a>43.4.2. Rows vs. Columns</h4> +<h4 id="schema.smackdown.rowscols"><a class="anchor" href="#schema.smackdown.rowscols"></a>44.4.2. Rows vs. Columns</h4> <div class="paragraph"> <p>Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.</p> @@ -9148,7 +9202,7 @@ But there is also a middle path between these two options, and that is "Rows as </div> </div> <div class="sect3"> -<h4 id="schema.smackdown.rowsascols"><a class="anchor" href="#schema.smackdown.rowsascols"></a>43.4.3. Rows as Columns</h4> +<h4 id="schema.smackdown.rowsascols"><a class="anchor" href="#schema.smackdown.rowsascols"></a>44.4.3. Rows as Columns</h4> <div class="paragraph"> <p>The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. @@ -9159,7 +9213,7 @@ For an overview of this approach, see <a href="#schema.casestudies.log_steroids" </div> </div> <div class="sect2"> -<h3 id="casestudies.schema.listdata"><a class="anchor" href="#casestudies.schema.listdata"></a>43.5. Case Study - List Data</h3> +<h3 id="casestudies.schema.listdata"><a class="anchor" href="#casestudies.schema.listdata"></a>44.5. Case Study - List Data</h3> <div class="paragraph"> <p>The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.</p> </div> @@ -9274,7 +9328,7 @@ If you don’t have time to build it both ways and compare, my advice would </div> </div> <div class="sect1"> -<h2 id="schema.ops"><a class="anchor" href="#schema.ops"></a>44. Operational and Performance Configuration Options</h2> +<h2 id="schema.ops"><a class="anchor" href="#schema.ops"></a>45. Operational and Performance Configuration Options</h2> <div class="sectionbody"> <div class="paragraph"> <p>See the Performance section <a href="#perf.schema">perf.schema</a> for more information operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.</p> @@ -9319,7 +9373,7 @@ In the notes below, we refer to o.a.h.h.mapreduce but replace with the o.a.h.h.m </div> </div> <div class="sect1"> -<h2 id="hbase.mapreduce.classpath"><a class="anchor" href="#hbase.mapreduce.classpath"></a>45. HBase, MapReduce, and the CLASSPATH</h2> +<h2 id="
<TRUNCATED>
