[jira] [Created] (HBASE-29013) Make PerformanceEvaluation support larger data sets

Junegunn Choi (Jira) Mon, 02 Dec 2024 19:43:09 -0800

Junegunn Choi created HBASE-29013:
-------------------------------------

             Summary: Make PerformanceEvaluation support larger data sets
                 Key: HBASE-29013
                 URL: https://issues.apache.org/jira/browse/HBASE-29013
             Project: HBase
          Issue Type: Improvement
            Reporter: Junegunn Choi



The use of 4-byte integers in PerformanceEvaluation can be limiting when you 
want to test with larger data sets. Suppose you want to generate 10TB of data 
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240 
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented 
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to 
circumvent the limitation, but I don't see a good reason to have the limitation 
in the first place.

And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause 
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 --nomapred randomWrite 
20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO  [main {}] hbase.PerformanceEvaluation: Created 20 
connections for 20 threads
2024-12-03T12:21:10,337 INFO  [TestClient-5 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-1 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-3 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-7 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO  [TestClient-8 {}] hbase.PerformanceEvaluation: 
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at 
offset 0 for 1073741824 rows

...

2024-12-03T12:21:10,338 INFO  [TestClient-17 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-16 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-6 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO  [TestClient-4 {}] hbase.PerformanceEvaluation: 
Sampling 1 every 0 out of 1073741824 total rows.

...

java.io.IOException: java.lang.ArithmeticException: / by zero
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
        at 
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HBASE-29013) Make PerformanceEvaluation support larger data sets

Reply via email to