[jira] [Commented] (PHOENIX-2892) Scan for pre-warming the block cache for 2ndary index should be removed

Enis Soztutar (JIRA) Mon, 16 May 2016 18:51:35 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285859#comment-15285859
 ]


Enis Soztutar commented on PHOENIX-2892:
----------------------------------------

bq. The initial scan makes the overall operation faster, not slower. If it 
doesn't anymore, we should remove it.
How did you guys initially test it? I can try to replicate the setup with and 
without the patch if it is documented somewhere. I can understand why skip scan 
can be faster than a batch of gets. But in this case we are doing a skip scan 
before a batch of gets not instead of. So I think it should be different than 
the test done in the thread 
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/34697. 

> Scan for pre-warming the block cache for 2ndary index should be removed
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-2892
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2892
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 4.8.0
>
>         Attachments: phoenix-2892_v1.patch
>
>
> We have run into an issue in a mid-sized cluster with secondary indexes. The 
> problem is that all handlers for doing writes were blocked waiting on a 
> single scan from the secondary index to complete for > 5mins, thus causing 
> all incoming RPCs to timeout and causing write un-availability and further 
> problems (disabling the index, etc). We've taken jstack outputs continuously 
> from the servers to understand what is going on. 
> In the jstack outputs from that particular server, we can see three types of 
> stacks (this is raw jstack so the thread names are not there unfortunately). 
>   - First, there are a lot of threads waiting for the MVCC transactions 
> started previously: 
> {code}
> Thread 15292: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
> imprecise)
>  - 
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl$WriteEntry)
>  @bci=86, line=253 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl$WriteEntry,
>  org.apache.hadoop.hbase.regionserver.SequenceId) @bci=29, line=135 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=1906, line=3187 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=79, line=2819 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[],
>  long, long) @bci=12, line=2761 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  org.apache.hadoop.hbase.regionserver.Region, 
> org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, 
> org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region,
>  org.apache.hadoop.hbase.quotas.OperationQuota, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, 
> org.apache.hadoop.hbase.CellScanner, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  java.util.List, long) @bci=547, line=654 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController,
>  org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) 
> @bci=407, line=2032 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor,
>  com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, 
> line=32213 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService,
>  com.google.protobuf.Descriptors$MethodDescriptor, 
> com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, 
> org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 
> (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue)
>  @bci=54, line=130 (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 
> (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> The way MVCC works is that it assumes that transactions are short living, and 
> it guarantees that transactions are committed in strict serial order. 
> Transactions in this case are write requests coming in and being executed 
> from handlers. Each handler will start a transaction, get a mvcc write index 
> (which is the mvcc trx number) and does the WAL append + memstore append. 
> Then it marks the mvcc trx to be complete, and before returning to the user, 
> we have to guarantee that the transaction is visible. So we wait for the mvcc 
> read point to be advanced beyond our own write trx number. This is done at 
> the above stack trace (waitForPreviousTransactionsComplete). A lot of threads 
> with this stack means that one or more handlers have started a mvcc 
> transaction, but did not finish the work and thus did not complete their 
> transactions. MVCC read point can only advance serially, otherwise ongoing 
> transactions will be visible, thus we just wait for the previously started 
> transactions to complete. 
>  - second set of threads are waiting with this stack trace: 
> {code}
> Thread 15274: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
> @bci=20, line=226 (Interpreted frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(int,
>  long) @bci=122, line=1033 (Interpreted frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(int,
>  long) @bci=25, line=1326 (Interpreted frame)
>  - java.util.concurrent.CountDownLatch.await(long, 
> java.util.concurrent.TimeUnit) @bci=10, line=282 (Interpreted frame)
>  - org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(byte[], 
> boolean) @bci=94, line=5044 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=386, line=2962 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=79, line=2819 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[],
>  long, long) @bci=12, line=2761 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  org.apache.hadoop.hbase.regionserver.Region, 
> org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, 
> org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region,
>  org.apache.hadoop.hbase.quotas.OperationQuota, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, 
> org.apache.hadoop.hbase.CellScanner, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  java.util.List, long) @bci=547, line=654 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController,
>  org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) 
> @bci=407, line=2032 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor,
>  com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, 
> line=32213 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService,
>  com.google.protobuf.Descriptors$MethodDescriptor, 
> com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, 
> org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 
> (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue)
>  @bci=54, line=130 (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 
> (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> These are the threads that want to write something, but they have to wait for 
> the row lock(s) that are held from the other threads. It is very likely that 
> the threads in the previous set are holding the actual row locks that these 
> set of threads wants to acquire. This is normal behavior, nothing concerning. 
> However, if the handler cannot acquire the row lock in 60 seconds, then the 
> handler will throw an exception, and RPC will be retried.
>  - Finally the only thread in the third set is doing a scan for the index 
> update: 
> {code}
> Thread 15276: (state = IN_JAVA)
>  - 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore$MemStoreScanner.peek() 
> @bci=0, line=847 (Compiled frame; information may be imprecise)
>  - 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap$KVScannerComparator.compare(org.apache.hadoop.hbase.regionserver.KeyValueScanner,
>  org.apache.hadoop.hbase.regionserver.KeyValueScanner) @bci=2, line=181 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap$KVScannerComparator.compare(java.lang.Object,
>  java.lang.Object) @bci=9, line=171 (Compiled frame)
>  - java.util.PriorityQueue.siftUpUsingComparator(int, java.lang.Object) 
> @bci=25, line=649 (Compiled frame)
>  - java.util.PriorityQueue.siftUp(int, java.lang.Object) @bci=10, line=627 
> (Compiled frame)
>  - java.util.PriorityQueue.offer(java.lang.Object) @bci=67, line=329 
> (Compiled frame)
>  - java.util.PriorityQueue.add(java.lang.Object) @bci=2, line=306 (Compiled 
> frame)
>  - org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(boolean, 
> org.apache.hadoop.hbase.Cell, boolean, boolean) @bci=36, line=289 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(org.apache.hadoop.hbase.Cell)
>  @bci=5, line=256 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(org.apache.hadoop.hbase.Cell)
>  @bci=53, line=817 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(org.apache.hadoop.hbase.Cell)
>  @bci=2, line=803 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.StoreScanner.next(java.util.List, 
> org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=876, line=636 
> (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(java.util.List, 
> org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=29, line=147 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(java.util.List,
>  org.apache.hadoop.hbase.regionserver.KeyValueHeap, 
> org.apache.hadoop.hbase.regionserver.ScannerContext, byte[], int, short) 
> @bci=22, line=5483 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(java.util.List,
>  org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=385, line=5634 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(java.util.List,
>  org.apache.hadoop.hbase.regionserver.ScannerContext) @bci=29, line=5421 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(java.util.List)
>  @bci=6, line=5407 (Compiled frame)
>  - 
> org.apache.phoenix.index.PhoenixIndexBuilder.batchStarted(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress)
>  @bci=279, line=88 (Compiled frame)
>  - 
> org.apache.phoenix.hbase.index.builder.IndexBuildManager.getIndexUpdate(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress,
>  java.util.Collection) @bci=5, line=121 (Compiled frame)
>  - 
> org.apache.phoenix.hbase.index.Indexer.preBatchMutateWithExceptions(org.apache.hadoop.hbase.coprocessor.ObserverContext,
>  org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) @bci=300, 
> line=278 (Compiled frame)
>  - 
> org.apache.phoenix.hbase.index.Indexer.preBatchMutate(org.apache.hadoop.hbase.coprocessor.ObserverContext,
>  org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress) @bci=17, 
> line=207 (Interpreted frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$35.call(org.apache.hadoop.hbase.coprocessor.RegionObserver,
>  org.apache.hadoop.hbase.coprocessor.ObserverContext) @bci=6, line=991 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(org.apache.hadoop.hbase.Coprocessor,
>  org.apache.hadoop.hbase.coprocessor.ObserverContext) @bci=6, line=1673 
> (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(boolean,
>  
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$CoprocessorOperation)
>  @bci=88, line=1748 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$CoprocessorOperation)
>  @bci=3, line=1705 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(org.apache.hadoop.hbase.regionserver.MiniBatchOperationInProgress)
>  @bci=26, line=987 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=996, line=3044 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.regionserver.HRegion$BatchOperationInProgress)
>  @bci=79, line=2819 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(org.apache.hadoop.hbase.client.Mutation[],
>  long, long) @bci=12, line=2761 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  org.apache.hadoop.hbase.regionserver.Region, 
> org.apache.hadoop.hbase.quotas.OperationQuota, java.util.List, 
> org.apache.hadoop.hbase.CellScanner) @bci=150, line=692 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(org.apache.hadoop.hbase.regionserver.Region,
>  org.apache.hadoop.hbase.quotas.OperationQuota, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionAction, 
> org.apache.hadoop.hbase.CellScanner, 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$RegionActionResult$Builder,
>  java.util.List, long) @bci=547, line=654 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(com.google.protobuf.RpcController,
>  org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest) 
> @bci=407, line=2032 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor,
>  com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=167, 
> line=32213 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcServer.call(com.google.protobuf.BlockingService,
>  com.google.protobuf.Descriptors$MethodDescriptor, 
> com.google.protobuf.Message, org.apache.hadoop.hbase.CellScanner, long, 
> org.apache.hadoop.hbase.monitoring.MonitoredRPCHandler) @bci=59, line=2114 
> (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.CallRunner.run() @bci=345, line=101 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(java.util.concurrent.BlockingQueue)
>  @bci=54, line=130 (Compiled frame)
>  - org.apache.hadoop.hbase.ipc.RpcExecutor$1.run() @bci=20, line=107 
> (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> This thread is doing a "scan" to be able to compute the index updates (note 
> HRegion.doMiniBatchMutation -> IndexBuildManager.getIndexUpdate -> Region 
> Scan). 
> Normally, this scan should finish in a short interval, but for whatever 
> reason it takes more than 5 minutes. We know that it takes more than 5 
> minutes, because in the jstacks from epoch 1462985796 to epoch 1462985958, we 
> see the exact same thread (Thread 15276) continuing the scan. 
> So, we do a skip scan which takes >5 minutes while holding some row locks, 
> and holding the MVCC write transaction open. All the other handlers that 
> start write transactions after this also holds some other row locks and waits 
> for the handler doing the scan to finish. The other handlers are just waiting 
> for some of the same row locks. So in effect everybody is waiting on one 
> handler thread to finish the scan. 
> So, why we are doing a scan in the index builder code, and why it is taking 
> more than 5 minutes? 
> Apparently, we are doing a skip scan in PhoenixIndexBuilder: 
> {code}
>     public void batchStarted(MiniBatchOperationInProgress<Mutation> 
> miniBatchOp, IndexMetaData context) throws IOException {
>         // The entire purpose of this method impl is to get the existing rows 
> for the
>         // table rows being indexed into the block cache, as the index 
> maintenance code
>         // does a point scan per row.
>         List<IndexMaintainer> indexMaintainers = 
> ((PhoenixIndexMetaData)context).getIndexMaintainers();
>         List<KeyRange> keys = 
> Lists.newArrayListWithExpectedSize(miniBatchOp.size());
>         Map<ImmutableBytesWritable, IndexMaintainer> maintainers =
>                 new HashMap<ImmutableBytesWritable, IndexMaintainer>();
>         ImmutableBytesWritable indexTableName = new ImmutableBytesWritable();
>         for (int i = 0; i < miniBatchOp.size(); i++) {
>             Mutation m = miniBatchOp.getOperation(i);
>             keys.add(PVarbinary.INSTANCE.getKeyRange(m.getRow()));
>             
>             for(IndexMaintainer indexMaintainer: indexMaintainers) {
>                 if (indexMaintainer.isImmutableRows()) continue;
>                 indexTableName.set(indexMaintainer.getIndexTableName());
>                 if (maintainers.get(indexTableName) != null) continue;
>                 maintainers.put(indexTableName, indexMaintainer);
>             }
>         }
>         if (maintainers.isEmpty()) return;
>         Scan scan = IndexManagementUtil.newLocalStateScan(new 
> ArrayList<IndexMaintainer>(maintainers.values()));
>         ScanRanges scanRanges = ScanRanges.createPointLookup(keys);
>         scanRanges.initializeScan(scan);
>         scan.setFilter(new 
> SkipScanFilter(scanRanges.getSkipScanFilter(),true));
>         Region region = env.getRegion();
>         RegionScanner scanner = region.getScanner(scan);
>         // Run through the scanner using internal nextRaw method
>         region.startRegionOperation();
>         try {
>             synchronized (scanner) {
>                 boolean hasMore;
>                 do {
>                     List<Cell> results = Lists.newArrayList();
>                     // Results are potentially returned even when the return 
> value of s.next is
>                     // false since this is an indication of whether or not 
> there are more values
>                     // after the ones returned
>                     hasMore = scanner.nextRaw(results);
>                 } while (hasMore);
>             }
>         }
> {code}
> I think that for some cases, this skip scan turns into an expensive and long 
> scan (instead of point lookup scan). However, there is no easy way to debug 
> the actual rows and the schema from that. 
> The comments say that it is done to bring the results to the block cache for 
> the following gets to come. However, there should not be a need for that at 
> all since the reads themselves will bring those results to the block cache. 
> The only explanation maybe to use prefetching from the skip scan, but it is 
> still not correct to do double-work. 
> The other thing is that we execute the getIndexUpdate()'s in parallel later, 
> but we are still processing the skip scan serially. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2892) Scan for pre-warming the block cache for 2ndary index should be removed

Reply via email to