[ https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani resolved HBASE-27904. ---------------------------------- Fix Version/s: (was: 2.6.0) Hadoop Flags: Reviewed Resolution: Fixed > A random data generator tool leveraging bulk load. > -------------------------------------------------- > > Key: HBASE-27904 > URL: https://issues.apache.org/jira/browse/HBASE-27904 > Project: HBase > Issue Type: New Feature > Components: util > Reporter: Himanshu Gwalani > Assignee: Himanshu Gwalani > Priority: Major > Fix For: 3.0.0-beta-1 > > > As of now, there is no data generator tool in HBase leveraging bulk load. > Since bulk load skips client writes path, it's much faster to generate data > and use of for load/performance tests where client writes are not a mandate. > {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load > testing. > {*}Requirements{*}: > 1. Tooling should generate RANDOM data on the fly and should not require any > pre-generated data as CSV/XML files as input. > 2. Tooling should support pre-splited tables (number of splits to be taken as > input). > 3. Data should be UNIFORMLY distributed across all regions of the table. > *High-level Steps* > 1. A table will be created (pre-splited with number of splits as input) > 2. The mapper of a custom Map Reduce job will generate random key-value pair > and ensure that those are equally distributed across all regions of the table. > 3. > [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java] > will be used to add reducer to the MR job and create HFiles based on key > value pairs generated by mapper. > 4. Bulk load those HFiles to the respective regions of the table using > [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html] > *Results* > We had POC for this tool in our organization, tested this tool with a 11 > nodes HBase cluster (having HBase + Hadoop services running). The tool > generated: > 1. *100* *GB* of data in *6 minutes* > 2. *340 GB* of data in *13 minutes* > 3. *3.5 TB* of data in *3 hours and 10 minutes* > *Usage* > hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool > -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 1000000 -split-count > 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false" > -- This message was sent by Atlassian Jira (v8.20.10#820010)