Hello,
We recently upgrade our staging cluster running on S3 backed EMR from HBase
1.3.0 to HBase 1.4.9. After doing so and running a test load, we noticed a
sudden spike in CPU utilization on one of the nodes in the cluster (jumping
from 20% to 60% all at once). After looking at CPU time of the threads and
taking a thread dump of the application, we noticed that there was exactly two
threads each with enormous CPU usage running since the start of the spike in
CPU. They both appear to be stuck on the CellComparator compare method. We
turned off all request to the cluster and the CPU usage and those threads
remain in the same state. I’ve pasted some data and stack traces below and also
attached full thread dumps. We haven’t had this issue when running Hbase 1.3.
What top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7198 hbase 20 0 7944m 2.6g 40m S 213.0 17.9 135:39.08 java
4755 root 20 0 4329m 262m 23m S 0.7 1.7 0:30.52 java
6630 yarn 20 0 4014m 430m 36m S 0.3 2.9 0:48.25 java
1 root 20 0 19680 2708 2324 S 0.0 0.0 0:01.37 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 ksoftirqd/0
8 root 20 0 0 0 0 I 0.0 0.0 0:02.74 rcu_sched
9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 migration/0
11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1
15 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/1
16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1
18 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H
A breakdown of the threads
sudo top -n 1 -H -p 7198
top - 23:25:50 up 1:50, 2 users, load average: 2.11, 2.28, 2.27
Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie
Cpu(s): 36.5%us, 1.0%sy, 0.0%ni, 62.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.2%st
Mem: 15394140k total, 10347008k used, 5047132k free, 60708k buffers
Swap: 0k total, 0k used, 0k free, 5358004k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7402 hbase 20 0 7944m 2.7g 40m R 92.8 18.7 65:12.87 java
7399 hbase 20 0 7944m 2.7g 40m R 90.8 18.7 65:17.01 java
7257 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.49 java
7258 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.30 java
7259 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.46 java
7260 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.69 java
10443 hbase 20 0 7944m 2.7g 40m S 2.0 18.7 0:14.43 java
7198 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.01 java
7256 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:06.54 java
7261 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:13.84 java
7262 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:41.45 java
7263 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.48 java
7264 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.41 java
7265 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
7266 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java
7267 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:33.95 java
7268 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:32.26 java
7269 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:08.90 java
Stacktrace for PID 7402:
RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020
priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId
(decimal):7402 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158)
at
org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at
org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at
org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at
org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at
org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006964deaf0> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None
Stacktrace for PID 7399
priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId
(decimal):7399 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at
org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at
org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at
org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at
org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006ae08a770> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None
Does anyone know what is going on? Is there possibly an infinite loop or some
other mechanism causing these threads to get stuck?
Thanks,
Srinidhi