Maybe you were doing scans with filters that were very heavy and taking a
long time. According to the thread dumps, it looks like the 2 scans were
not stuck as they were RUNNABLE and I think the scans were just taking a
long time.

On Thu, Apr 18, 2019 at 9:06 AM Srinidhi Muppalla <[email protected]>
wrote:

> Hello,
>
>
>
> We recently upgrade our staging cluster running on S3 backed EMR from
> HBase 1.3.0 to HBase 1.4.9. After doing so and running a test load, we
> noticed a sudden spike in CPU utilization on one of the nodes in the
> cluster (jumping from 20% to 60% all at once). After looking at CPU time of
> the threads and taking a thread dump of the application, we noticed that
> there was exactly two threads each with enormous CPU usage running since
> the start of the spike in CPU. They both appear to be stuck on the
> CellComparator compare method. We turned off all request to the cluster and
> the CPU usage and those threads remain in the same state. I’ve pasted some
> data and stack traces below and also attached full thread dumps. We haven’t
> had this issue when running Hbase 1.3.
>
>
>
> What top shows:
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
> 7198 hbase     20   0 7944m 2.6g  40m S 213.0 17.9 135:39.08 java
>
> 4755 root      20   0 4329m 262m  23m S  0.7  1.7   0:30.52 java
>
> 6630 yarn      20   0 4014m 430m  36m S  0.3  2.9   0:48.25 java
>
>     1 root      20   0 19680 2708 2324 S  0.0  0.0   0:01.37 init
>
>     2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>
>     4 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 kworker/0:0H
>
>     6 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 mm_percpu_wq
>
>     7 root      20   0     0    0    0 S  0.0  0.0   0:01.47 ksoftirqd/0
>
>     8 root      20   0     0    0    0 I  0.0  0.0   0:02.74 rcu_sched
>
>     9 root      20   0     0    0    0 I  0.0  0.0   0:00.00 rcu_bh
>
>    10 root      RT   0     0    0    0 S  0.0  0.0   0:00.04 migration/0
>
>    11 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 watchdog/0
>
>    12 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuhp/0
>
>    13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuhp/1
>
>    14 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 watchdog/1
>
>    15 root      RT   0     0    0    0 S  0.0  0.0   0:00.03 migration/1
>
>    16 root      20   0     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/1
>
>    18 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 kworker/1:0H
>
>
>
> A breakdown of the threads
>
>
>
> sudo top -n 1 -H -p 7198
>
> top - 23:25:50 up  1:50,  2 users,  load average: 2.11, 2.28, 2.27
>
> Tasks: 232 total,   2 running, 230 sleeping,   0 stopped,   0 zombie
>
> Cpu(s): 36.5%us,  1.0%sy,  0.0%ni, 62.0%id,  0.1%wa,  0.0%hi,  0.2%si,
> 0.2%st
>
> Mem:  15394140k total, 10347008k used,  5047132k free,    60708k buffers
>
> Swap:        0k total,        0k used,        0k free,  5358004k cached
>
>
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
> 7402 hbase     20   0 7944m 2.7g  40m R 92.8 18.7  65:12.87 java
>
> 7399 hbase     20   0 7944m 2.7g  40m R 90.8 18.7  65:17.01 java
>
> 7257 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.49 java
>
> 7258 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.30 java
>
> 7259 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.46 java
>
> 7260 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.69 java
>
> 10443 hbase     20   0 7944m 2.7g  40m S  2.0 18.7   0:14.43 java
>
> 7198 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.01 java
>
> 7256 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:06.54 java
>
> 7261 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:13.84 java
>
> 7262 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:41.45 java
>
> 7263 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.48 java
>
> 7264 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.41 java
>
> 7265 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.00 java
>
> 7266 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.00 java
>
> 7267 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:33.95 java
>
> 7268 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:32.26 java
>
> 7269 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:08.90 java
>
>
>
> Stacktrace for PID 7402:
>
> RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020
>
> priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId
> (decimal):7402 - state:RUNNABLE
> stackTrace:
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158)
> at
> org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224)
> at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66)
> at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
> at
> org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
> at
> org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
> at
> org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
> - locked *<0x00000006964deaf0>* (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Locked ownable synchronizers:
> - None
>
>
>
> Stacktrace for PID 7399
>
> priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId
> (decimal):7399 - state:RUNNABLE
> stackTrace:
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187)
> at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63)
> at
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
> at
> org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
> at
> org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
> at
> org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
> at
> org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
> - locked *<0x00000006ae08a770>* (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Locked ownable synchronizers:
> - None
>
>
>
> Does anyone know what is going on? Is there possibly an infinite loop or
> some other mechanism causing these threads to get stuck?
>
>
>
> Thanks,
>
> Srinidhi
>
>
>
>
>

Reply via email to