[ https://issues.apache.org/jira/browse/PHOENIX-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285859#comment-15285859 ]
Enis Soztutar commented on PHOENIX-2892: ---------------------------------------- bq. The initial scan makes the overall operation faster, not slower. If it doesn't anymore, we should remove it. How did you guys initially test it? I can try to replicate the setup with and without the patch if it is documented somewhere. I can understand why skip scan can be faster than a batch of gets. But in this case we are doing a skip scan before a batch of gets not instead of. So I think it should be different than the test done in the thread http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/34697. > Scan for pre-warming the block cache for 2ndary index should be removed > ----------------------------------------------------------------------- > > Key: PHOENIX-2892 > URL: https://issues.apache.org/jira/browse/PHOENIX-2892 > Project: Phoenix > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 4.8.0 > > Attachments: phoenix-2892_v1.patch > > > We have run into an issue in a mid-sized cluster with secondary indexes. The > problem is that all handlers for doing writes were blocked waiting on a > single scan from the secondary index to complete for > 5mins, thus causing > all incoming RPCs to timeout and causing write un-availability and further > problems (disabling the index, etc). We've taken jstack outputs continuously > from the servers to understand what is going on. > In the jstack outputs from that particular server, we can see three types of > stacks (this is raw jstack so the thread names are not there unfortunately). > - First, there are a lot of threads waiting for the MVCC transactions > started previously: > {code} > Thread 15292: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - > org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl$WriteEntry) > @bci=86, line=253 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl$WriteEntry, > org.apache.hadoop.hbase.regionserver.SequenceId) @bci=29, line=135 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=1906, line=3187 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=79, line=2819 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[], > long, long) @bci=12, line=2761 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, > org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, > org.apache.hadoop.hbase.CellScanner, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > java.util.List, long) @bci=547, line=654 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) > @bci=407, line=2032 (Compiled frame) > - > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, > line=32213 (Compiled frame) > - > org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService, > com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, > org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 > (Compiled frame) > - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled > frame) > - > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue) > @bci=54, line=130 (Compiled frame) > - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > {code} > The way MVCC works is that it assumes that transactions are short living, and > it guarantees that transactions are committed in strict serial order. > Transactions in this case are write requests coming in and being executed > from handlers. Each handler will start a transaction, get a mvcc write index > (which is the mvcc trx number) and does the WAL append + memstore append. > Then it marks the mvcc trx to be complete, and before returning to the user, > we have to guarantee that the transaction is visible. So we wait for the mvcc > read point to be advanced beyond our own write trx number. This is done at > the above stack trace (waitForPreviousTransactionsComplete). A lot of threads > with this stack means that one or more handlers have started a mvcc > transaction, but did not finish the work and thus did not complete their > transactions. MVCC read point can only advance serially, otherwise ongoing > transactions will be visible, thus we just wait for the previously started > transactions to complete. > - second set of threads are waiting with this stack trace: > {code} > Thread 15274: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) > @bci=20, line=226 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(int, > long) @bci=122, line=1033 (Interpreted frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(int, > long) @bci=25, line=1326 (Interpreted frame) > - java.util.concurrent.CountDownLatch.await(long, > java.util.concurrent.TimeUnit) @bci=10, line=282 (Interpreted frame) > - org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(byte[], > boolean) @bci=94, line=5044 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=386, line=2962 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=79, line=2819 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[], > long, long) @bci=12, line=2761 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, > org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, > org.apache.hadoop.hbase.CellScanner, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > java.util.List, long) @bci=547, line=654 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) > @bci=407, line=2032 (Compiled frame) > - > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, > line=32213 (Compiled frame) > - > org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService, > com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, > org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 > (Compiled frame) > - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled > frame) > - > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue) > @bci=54, line=130 (Compiled frame) > - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > {code} > These are the threads that want to write something, but they have to wait for > the row lock(s) that are held from the other threads. It is very likely that > the threads in the previous set are holding the actual row locks that these > set of threads wants to acquire. This is normal behavior, nothing concerning. > However, if the handler cannot acquire the row lock in 60 seconds, then the > handler will throw an exception, and RPC will be retried. > - Finally the only thread in the third set is doing a scan for the index > update: > {code} > Thread 15276: (state = IN_JAVA) > - > org.apache.hadoop.hbase.regionserver.DefaultMemStore$MemStoreScanner.peek() > @bci=0, line=847 (Compiled frame; information may be imprecise) > - > org.apache.hadoop.hbase.regionserver.KeyValueHeap$KVScannerComparator.compare(org.apache.hadoop.hbase.regionserver.KeyValueScanner, > org.apache.hadoop.hbase.regionserver.KeyValueScanner) @bci=2, line=181 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.KeyValueHeap$KVScannerComparator.compare(java.lang.Object, > java.lang.Object) @bci=9, line=171 (Compiled frame) > - java.util.PriorityQueue.siftUpUsingComparator(int, java.lang.Object) > @bci=25, line=649 (Compiled frame) > - java.util.PriorityQueue.siftUp(int, java.lang.Object) @bci=10, line=627 > (Compiled frame) > - java.util.PriorityQueue.offer(java.lang.Object) @bci=67, line=329 > (Compiled frame) > - java.util.PriorityQueue.add(java.lang.Object) @bci=2, line=306 (Compiled > frame) > - org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(boolean, > org.apache.hadoop.hbase.Cell, boolean, boolean) @bci=36, line=289 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(org.apache.hadoop.hbase.Cell) > @bci=5, line=256 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(org.apache.hadoop.hbase.Cell) > @bci=53, line=817 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(org.apache.hadoop.hbase.Cell) > @bci=2, line=803 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.StoreScanner.next(java.util.List, > org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=876, line=636 > (Compiled frame) > - org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(java.util.List, > org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=29, line=147 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(java.util.List, > org.apache.hadoop.hbase.regionserver.KeyValueHeap, > org.apache.hadoop.hbase.regionserver.ScannerContext, byte[], int, short) > @bci=22, line=5483 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(java.util.List, > org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=385, line=5634 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(java.util.List, > org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=29, line=5421 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(java.util.List) > @bci=6, line=5407 (Compiled frame) > - > org.apache.phoenix.index.PhoenixIndexBuilder.batchStarted(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) > @bci=279, line=88 (Compiled frame) > - > org.apache.phoenix.hbase.index.builder.IndexBuildManager.getIndexUpdate(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress, > java.util.Collection) @bci=5, line=121 (Compiled frame) > - > org.apache.phoenix.hbase.index.Indexer.preBatchMutateWithExceptions(org.apache.hadoop.hbase.coprocessor.ObserverContext, > org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) @bci=300, > line=278 (Compiled frame) > - > org.apache.phoenix.hbase.index.Indexer.preBatchMutate(org.apache.hadoop.hbase.coprocessor.ObserverContext, > org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) @bci=17, > line=207 (Interpreted frame) > - > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$35.call(org.apache.hadoop.hbase.coprocessor.RegionObserver, > org.apache.hadoop.hbase.coprocessor.ObserverContext) @bci=6, line=991 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(org.apache.hadoop.hbase.Coprocessor, > org.apache.hadoop.hbase.coprocessor.ObserverContext) @bci=6, line=1673 > (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(boolean, > > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$CoprocessorOperation) > @bci=88, line=1748 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$CoprocessorOperation) > @bci=3, line=1705 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) > @bci=26, line=987 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=996, line=3044 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress) > @bci=79, line=2819 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[], > long, long) @bci=12, line=2761 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, > org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region, > org.apache.hadoop.hbase.quotas.OperationQuota, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, > org.apache.hadoop.hbase.CellScanner, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder, > java.util.List, long) @bci=547, line=654 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController, > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) > @bci=407, line=2032 (Compiled frame) > - > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, > line=32213 (Compiled frame) > - > org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService, > com.google.protobuf.Descriptors$MethodDescriptor, > com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, > org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 > (Compiled frame) > - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled > frame) > - > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue) > @bci=54, line=130 (Compiled frame) > - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > {code} > This thread is doing a "scan" to be able to compute the index updates (note > HRegion.doMiniBatchMutation -> IndexBuildManager.getIndexUpdate -> Region > Scan). > Normally, this scan should finish in a short interval, but for whatever > reason it takes more than 5 minutes. We know that it takes more than 5 > minutes, because in the jstacks from epoch 1462985796 to epoch 1462985958, we > see the exact same thread (Thread 15276) continuing the scan. > So, we do a skip scan which takes >5 minutes while holding some row locks, > and holding the MVCC write transaction open. All the other handlers that > start write transactions after this also holds some other row locks and waits > for the handler doing the scan to finish. The other handlers are just waiting > for some of the same row locks. So in effect everybody is waiting on one > handler thread to finish the scan. > So, why we are doing a scan in the index builder code, and why it is taking > more than 5 minutes? > Apparently, we are doing a skip scan in PhoenixIndexBuilder: > {code} > public void batchStarted(MiniBatchOperationInProgress<Mutation> > miniBatchOp, IndexMetaData context) throws IOException { > // The entire purpose of this method impl is to get the existing rows > for the > // table rows being indexed into the block cache, as the index > maintenance code > // does a point scan per row. > List<IndexMaintainer> indexMaintainers = > ((PhoenixIndexMetaData)context).getIndexMaintainers(); > List<KeyRange> keys = > Lists.newArrayListWithExpectedSize(miniBatchOp.size()); > Map<ImmutableBytesWritable, IndexMaintainer> maintainers = > new HashMap<ImmutableBytesWritable, IndexMaintainer>(); > ImmutableBytesWritable indexTableName = new ImmutableBytesWritable(); > for (int i = 0; i < miniBatchOp.size(); i++) { > Mutation m = miniBatchOp.getOperation(i); > keys.add(PVarbinary.INSTANCE.getKeyRange(m.getRow())); > > for(IndexMaintainer indexMaintainer: indexMaintainers) { > if (indexMaintainer.isImmutableRows()) continue; > indexTableName.set(indexMaintainer.getIndexTableName()); > if (maintainers.get(indexTableName) != null) continue; > maintainers.put(indexTableName, indexMaintainer); > } > } > if (maintainers.isEmpty()) return; > Scan scan = IndexManagementUtil.newLocalStateScan(new > ArrayList<IndexMaintainer>(maintainers.values())); > ScanRanges scanRanges = ScanRanges.createPointLookup(keys); > scanRanges.initializeScan(scan); > scan.setFilter(new > SkipScanFilter(scanRanges.getSkipScanFilter(),true)); > Region region = env.getRegion(); > RegionScanner scanner = region.getScanner(scan); > // Run through the scanner using internal nextRaw method > region.startRegionOperation(); > try { > synchronized (scanner) { > boolean hasMore; > do { > List<Cell> results = Lists.newArrayList(); > // Results are potentially returned even when the return > value of s.next is > // false since this is an indication of whether or not > there are more values > // after the ones returned > hasMore = scanner.nextRaw(results); > } while (hasMore); > } > } > {code} > I think that for some cases, this skip scan turns into an expensive and long > scan (instead of point lookup scan). However, there is no easy way to debug > the actual rows and the schema from that. > The comments say that it is done to bring the results to the block cache for > the following gets to come. However, there should not be a need for that at > all since the reads themselves will bring those results to the block cache. > The only explanation maybe to use prefetching from the skip scan, but it is > still not correct to do double-work. > The other thing is that we execute the getIndexUpdate()'s in parallel later, > but we are still processing the skip scan serially. -- This message was sent by Atlassian JIRA (v6.3.4#6332)