Maybe you were doing scans with filters that were very heavy and taking a long time. According to the thread dumps, it looks like the 2 scans were not stuck as they were RUNNABLE and I think the scans were just taking a long time.
On Thu, Apr 18, 2019 at 9:06 AM Srinidhi Muppalla <[email protected]> wrote: > Hello, > > > > We recently upgrade our staging cluster running on S3 backed EMR from > HBase 1.3.0 to HBase 1.4.9. After doing so and running a test load, we > noticed a sudden spike in CPU utilization on one of the nodes in the > cluster (jumping from 20% to 60% all at once). After looking at CPU time of > the threads and taking a thread dump of the application, we noticed that > there was exactly two threads each with enormous CPU usage running since > the start of the spike in CPU. They both appear to be stuck on the > CellComparator compare method. We turned off all request to the cluster and > the CPU usage and those threads remain in the same state. I’ve pasted some > data and stack traces below and also attached full thread dumps. We haven’t > had this issue when running Hbase 1.3. > > > > What top shows: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 7198 hbase 20 0 7944m 2.6g 40m S 213.0 17.9 135:39.08 java > > 4755 root 20 0 4329m 262m 23m S 0.7 1.7 0:30.52 java > > 6630 yarn 20 0 4014m 430m 36m S 0.3 2.9 0:48.25 java > > 1 root 20 0 19680 2708 2324 S 0.0 0.0 0:01.37 init > > 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd > > 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H > > 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq > > 7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 ksoftirqd/0 > > 8 root 20 0 0 0 0 I 0.0 0.0 0:02.74 rcu_sched > > 9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh > > 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 migration/0 > > 11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0 > > 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 > > 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 > > 14 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1 > > 15 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/1 > > 16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1 > > 18 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H > > > > A breakdown of the threads > > > > sudo top -n 1 -H -p 7198 > > top - 23:25:50 up 1:50, 2 users, load average: 2.11, 2.28, 2.27 > > Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie > > Cpu(s): 36.5%us, 1.0%sy, 0.0%ni, 62.0%id, 0.1%wa, 0.0%hi, 0.2%si, > 0.2%st > > Mem: 15394140k total, 10347008k used, 5047132k free, 60708k buffers > > Swap: 0k total, 0k used, 0k free, 5358004k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 7402 hbase 20 0 7944m 2.7g 40m R 92.8 18.7 65:12.87 java > > 7399 hbase 20 0 7944m 2.7g 40m R 90.8 18.7 65:17.01 java > > 7257 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.49 java > > 7258 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.30 java > > 7259 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.46 java > > 7260 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.69 java > > 10443 hbase 20 0 7944m 2.7g 40m S 2.0 18.7 0:14.43 java > > 7198 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.01 java > > 7256 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:06.54 java > > 7261 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:13.84 java > > 7262 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:41.45 java > > 7263 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.48 java > > 7264 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.41 java > > 7265 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java > > 7266 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java > > 7267 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:33.95 java > > 7268 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:32.26 java > > 7269 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:08.90 java > > > > Stacktrace for PID 7402: > > RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020 > > priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId > (decimal):7402 - state:RUNNABLE > stackTrace: > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158) > at > org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224) > at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66) > at > org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026) > at > org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86) > at > org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371) > at > org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265) > at > org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847) > - locked *<0x00000006964deaf0>* (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) > Locked ownable synchronizers: > - None > > > > Stacktrace for PID 7399 > > priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId > (decimal):7399 - state:RUNNABLE > stackTrace: > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187) > at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63) > at > org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026) > at > org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86) > at > org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371) > at > org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265) > at > org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847) > - locked *<0x00000006ae08a770>* (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) > Locked ownable synchronizers: > - None > > > > Does anyone know what is going on? Is there possibly an infinite loop or > some other mechanism causing these threads to get stuck? > > > > Thanks, > > Srinidhi > > > > >
