Junegunn Choi created HBASE-29013:
-------------------------------------
Summary: Make PerformanceEvaluation support larger data sets
Key: HBASE-29013
URL: https://issues.apache.org/jira/browse/HBASE-29013
Project: HBase
Issue Type: Improvement
Reporter: Junegunn Choi
The use of 4-byte integers in PerformanceEvaluation can be limiting when you
want to test with larger data sets. Suppose you want to generate 10TB of data
with the default value size of 1KB, you would need 10G rows.
{code:java}
bin/hbase pe --nomapred --presplit=21 --compress=LZ4 --rows=10737418240
randomWrite 1
{code}
But you can't do it because {{--rows}} expect a number that can be represented
with 4 bytes.
{noformat}
java.lang.NumberFormatException: For input string: "10737418240"
{noformat}
We can instead increase the value size and decrease the number of the rows to
circumvent the limitation, but I don't see a good reason to have the limitation
in the first place.
And even if we use a smaller value for {{{}--row{}}}, we can accidentally cause
integer overflow as we increase the number of clients.
{code:java}
bin/hbase pe --nomapred --compress=LZ4 --rows=1073741824 --nomapred randomWrite
20
{code}
{noformat}
2024-12-03T12:21:10,333 INFO [main {}] hbase.PerformanceEvaluation: Created 20
connections for 20 threads
2024-12-03T12:21:10,337 INFO [TestClient-5 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-1 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-3 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-7 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset -1073741824 for 1073741824 rows
2024-12-03T12:21:10,337 INFO [TestClient-8 {}] hbase.PerformanceEvaluation:
Start class org.apache.hadoop.hbase.PerformanceEvaluation$RandomWriteTest at
offset 0 for 1073741824 rows
...
2024-12-03T12:21:10,338 INFO [TestClient-17 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-16 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-6 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
2024-12-03T12:21:10,338 INFO [TestClient-4 {}] hbase.PerformanceEvaluation:
Sampling 1 every 0 out of 1073741824 total rows.
...
java.io.IOException: java.lang.ArithmeticException: / by zero
at
org.apache.hadoop.hbase.PerformanceEvaluation.doLocalClients(PerformanceEvaluation.java:540)
at
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(PerformanceEvaluation.java:2674)
at
org.apache.hadoop.hbase.PerformanceEvaluation.run(PerformanceEvaluation.java:3216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
at
org.apache.hadoop.hbase.PerformanceEvaluation.main(PerformanceEvaluation.java:3250)
{noformat}
So I think it's best that we just use 8-byte long integers throughout the code.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)