[jira] [Updated] (HBASE-27904) A random data generator tool leveraging bulk load.

Himanshu Gwalani (Jira) Fri, 02 Jun 2023 10:51:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Himanshu Gwalani updated HBASE-27904:
-------------------------------------
    Description: 
As of now, there is no data generator tool in HBase leveraging bulk load. Since 
bulk load skips client writes path, it's much faster to generate data and use 
of for load/performance tests where client writes are not a mandate.
{*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
testing.

{*}Requirements{*}:
1. Tooling should generate RANDOM data on the fly and should not require any 
pre-generated data as CSV/XML files as input.
2. Tooling should support pre-splited tables (number of splits to be taken as 
input).
3. Data should be UNIFORMLY distributed across all regions of the table.

*High-level Steps*
1. A table will be created (pre-splited with number of splits as input)
2. The mapper of a custom Map Reduce job will generate random key-value pair 
and ensure that those are equally distributed across all regions of the table.
3. 
[HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
 will be used to add reducer to the MR job and create HFiles based on key value 
pairs generated by mapper. 
4. Bulk load those HFiles to the respective regions of the table using 
[LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]

*Results*
We had POC for this tool in our organization, tested this tool in our test 
environments with 11 nodes cluster and having HBase + Hadoop services running. 
The tool generated:
1. *100* *GB* of data in *6 minutes*
2. *340 GB* of data in *13 minutes*
3. *3.5 TB* of data in *3 hours and 10 minutes*

 

  was:
As of now, there is no data generator tool in HBase leveraging bulk load. Since 
bulk load skips client writes path, it's much faster to generate data and use 
of for load/performance tests where client writes are not a mandate.
{*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
testing.

{*}Requirements{*}:
1. Tooling should generate RANDOM data on the fly and should not require any 
pre-generated data as CSV/XML files as input.
2. Tooling should support pre-splited tables (number of splits to be taken as 
input).
3. Data should be UNIFORMLY distributed across all regions of the table.

*High-level Steps*
1. A table will be created (pre-splited with number of splits as input)
2. The mapper of a custom Map Reduce job will generate random key-value pair 
and ensure that those are equally distributed across all regions of the table.
3. 
[HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
 will be used to add reducer to the MR job and generate HFiles based on key 
value pairs generated by mapper. 
4. Bulk load those HFiles to the respective regions of the table using 
[LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]

*Results*
We had POC for this tool in our organization, tested this tool in our test 
environments with 11 nodes cluster and having HBase + Hadoop services running. 
The tool generated:
1. *100* *GB* of data in *6 minutes*
2. *340 GB* of data in *13 minutes*
3. *3.5 TB* of data in *3 hours and 10 minutes*

 


> A random data generator tool leveraging bulk load.
> --------------------------------------------------
>
>                 Key: HBASE-27904
>                 URL: https://issues.apache.org/jira/browse/HBASE-27904
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Himanshu Gwalani
>            Priority: Minor
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool in our test 
> environments with 11 nodes cluster and having HBase + Hadoop services 
> running. The tool generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27904) A random data generator tool leveraging bulk load.

Reply via email to