Tanuj Khurana created PHOENIX-7611: -------------------------------------- Summary: Memory corruption issue in Phoenix coprocessors af HBase 2 Key: PHOENIX-7611 URL: https://issues.apache.org/jira/browse/PHOENIX-7611 Project: Phoenix Issue Type: Bug Affects Versions: 5.2.1, 5.1.3, 5.1.2, 5.2.0, 5.1.1, 5.1.0, 5.0.0 Reporter: Tanuj Khurana
The memory corruption has surfaced in the form of segmentation faults which crashes the Regionserver. We have observed this in production in our environment as well as in ITs. We already have PHOENIX-7419 open for it. I was also hitting this issue when working on PHOENIX-7591 There sometimes the test would fail with a FATAL error message of SIGSEGV. But more often the test would fail with a silent corruption. After adding more logging, what I found was that some of the Cell references we were storing in IndexRegionObserver were getting corrupted. I started looking around in HBase for similar corruptions and found that from HBase 2 onwards the contract with the coprocessor for preBatchMutate hook says: *Do not retain references to any Cells in Mutations* beyond the life of this invocation. If need a Cell reference for later use, copy the cell and use that IndexRegionObserver maintains the row state in the memory as a Put mutation which references to the Cells in the Mutation to handle concurrent updates and the lifetime of these references exceeds the invocation of the hook. It seems in some cases these cells can be backed by off-heap memory which can be reclaimed or reused causing corruptions. This also lines up with the stack trace attached to PHOENIX-7419 ([^hs_err_pid783375.log)] {code:java} v ~StubRoutines::jbyte_disjoint_arraycopy J 23481 C2 org.apache.hadoop.hbase.unsafe.HBasePlatformDependent.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V (22 bytes) @ 0x00007fb765360c32 [0x00007fb765360be0+0x52] j org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+56 j org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+105 j org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+65 j org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+56 J 24630 C2 org.apache.phoenix.coprocessor.GlobalIndexRegionScanner.apply(Lorg/apache/hadoop/hbase/client/Put;Lorg/apache/hadoop/hbase/client/Put;)V (167 bytes) @ 0x00007fb7656262e0 [0x00007fb765625ca0+0x640] J 24258 C1 org.apache.phoenix.hbase.index.IndexRegionObserver.applyPendingPutMutations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V (430 bytes) @ 0x00007fb7654ac234 [0x00007fb7654aa880+0x19b4] j org.apache.phoenix.hbase.index.IndexRegionObserver.prepareDataRowStates(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V+30 J 25543 C1 org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutateWithExceptions(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V (1004 bytes) @ 0x00007fb764ef1ffc [0x00007fb764eef7c0+0x283c] J 25542 C1 org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutate(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V (76 bytes) @ 0x00007fb762d1dc24 [0x00007fb762d1db00+0x124] J 22752 C1 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$28.call(Ljava/lang/Object;)V (17 bytes) @ 0x00007fb7629b21d4 [0x00007fb7629b1f00+0x2d4] J 14450 C2 org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver()V (70 bytes) @ 0x00007fb762483240 [0x00007fb7624830c0+0x180] J 18110 C2 org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(Lorg/apache/hadoop/hbase/coprocessor/CoprocessorHost$ObserverOperation;)Z (274 bytes) @ 0x00007fb76463c74c [0x00007fb76463c320+0x42c] J 23033 C1 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V (42 bytes) @ 0x00007fb762b39dcc [0x00007fb762b39640+0x78c] J 14181 C1 org.apache.hadoop.hbase.regionserver.HRegion$MutationBatchOperation.prepareMiniBatchOperations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;JLjava/util/List;)V (105 bytes) @ 0x00007fb763a21b3c [0x00007fb763a21380+0x7bc] J 14199 C1 org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)V (970 bytes) @ 0x00007fb763a37a94 [0x00007fb763a36f20+0xb74] J 13124 C1 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)[Lorg/apache/hadoop/hbase/regionserver/OperationStatus; (354 bytes) @ 0x00007fb7636a5a64 [0x00007fb7636a5320+0x744] {code} This contract actually applies to all the methods in the RegionObserver contract and was updated in HBASE-15735 introduced in HBase 2. Phoenix has several coprocessors which implement the RegionObserver interface. We need to investigate all such implementations and fix them if they are holding on to cell references after the invocation of the hook API. Two patterns I have seen are: 1. We directly store the reference to the Cell or in a collection like List<Cell> 2. We store indirectly like in a Mutation object. It seems this is only a problem if we store references to Cells which extend the ByteBufferKeyValue which extends the ByteBufferExtendedCell since then can be backed by off-heap memory. KeyValue instances seem fine (the ones returned by [GenericKeyValueBuilder.java|https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/hbase/index/util/GenericKeyValueBuilder.java]) -- This message was sent by Atlassian Jira (v8.20.10#820010)