Tanuj Khurana created PHOENIX-7611:
--------------------------------------
Summary: Memory corruption issue in Phoenix coprocessors af HBase 2
Key: PHOENIX-7611
URL: https://issues.apache.org/jira/browse/PHOENIX-7611
Project: Phoenix
Issue Type: Bug
Affects Versions: 5.2.1, 5.1.3, 5.1.2, 5.2.0, 5.1.1, 5.1.0, 5.0.0
Reporter: Tanuj Khurana
The memory corruption has surfaced in the form of segmentation faults which
crashes the Regionserver. We have observed this in production in our
environment as well as in ITs. We already have PHOENIX-7419 open for it. I was
also hitting this issue when working on PHOENIX-7591 There sometimes the test
would fail with a FATAL error message of SIGSEGV. But more often the test would
fail with a silent corruption. After adding more logging, what I found was that
some of the Cell references we were storing in IndexRegionObserver were
getting corrupted.
I started looking around in HBase for similar corruptions and found that from
HBase 2 onwards the contract with the coprocessor for preBatchMutate hook says:
*Do not retain references to any Cells in Mutations* beyond the life of this
invocation. If need a Cell reference for later use, copy the cell and use that
IndexRegionObserver maintains the row state in the memory as a Put mutation
which references to the Cells in the Mutation to handle concurrent updates and
the lifetime of these references exceeds the invocation of the hook. It seems
in some cases these cells can be backed by off-heap memory which can be
reclaimed or reused causing corruptions.
This also lines up with the stack trace attached to PHOENIX-7419
([^hs_err_pid783375.log)]
{code:java}
v ~StubRoutines::jbyte_disjoint_arraycopy
J 23481 C2
org.apache.hadoop.hbase.unsafe.HBasePlatformDependent.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V
(22 bytes) @ 0x00007fb765360c32 [0x00007fb765360be0+0x52]
j
org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+56
j
org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+105
j
org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+65
j
org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+56
J 24630 C2
org.apache.phoenix.coprocessor.GlobalIndexRegionScanner.apply(Lorg/apache/hadoop/hbase/client/Put;Lorg/apache/hadoop/hbase/client/Put;)V
(167 bytes) @ 0x00007fb7656262e0 [0x00007fb765625ca0+0x640]
J 24258 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.applyPendingPutMutations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V
(430 bytes) @ 0x00007fb7654ac234 [0x00007fb7654aa880+0x19b4]
j
org.apache.phoenix.hbase.index.IndexRegionObserver.prepareDataRowStates(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V+30
J 25543 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutateWithExceptions(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(1004 bytes) @ 0x00007fb764ef1ffc [0x00007fb764eef7c0+0x283c]
J 25542 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutate(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(76 bytes) @ 0x00007fb762d1dc24 [0x00007fb762d1db00+0x124]
J 22752 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$28.call(Ljava/lang/Object;)V
(17 bytes) @ 0x00007fb7629b21d4 [0x00007fb7629b1f00+0x2d4]
J 14450 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver()V
(70 bytes) @ 0x00007fb762483240 [0x00007fb7624830c0+0x180]
J 18110 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(Lorg/apache/hadoop/hbase/coprocessor/CoprocessorHost$ObserverOperation;)Z
(274 bytes) @ 0x00007fb76463c74c [0x00007fb76463c320+0x42c]
J 23033 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(42 bytes) @ 0x00007fb762b39dcc [0x00007fb762b39640+0x78c]
J 14181 C1
org.apache.hadoop.hbase.regionserver.HRegion$MutationBatchOperation.prepareMiniBatchOperations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;JLjava/util/List;)V
(105 bytes) @ 0x00007fb763a21b3c [0x00007fb763a21380+0x7bc]
J 14199 C1
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)V
(970 bytes) @ 0x00007fb763a37a94 [0x00007fb763a36f20+0xb74]
J 13124 C1
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)[Lorg/apache/hadoop/hbase/regionserver/OperationStatus;
(354 bytes) @ 0x00007fb7636a5a64 [0x00007fb7636a5320+0x744] {code}
This contract actually applies to all the methods in the RegionObserver
contract and was updated in HBASE-15735 introduced in HBase 2. Phoenix has
several coprocessors which implement the RegionObserver interface. We need to
investigate all such implementations and fix them if they are holding on to
cell references after the invocation of the hook API.
Two patterns I have seen are:
1. We directly store the reference to the Cell or in a collection like
List<Cell>
2. We store indirectly like in a Mutation object.
It seems this is only a problem if we store references to Cells which extend
the ByteBufferKeyValue which extends the ByteBufferExtendedCell since then can
be backed by off-heap memory.
KeyValue instances seem fine (the ones returned by
[GenericKeyValueBuilder.java|https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/hbase/index/util/GenericKeyValueBuilder.java])
--
This message was sent by Atlassian Jira
(v8.20.10#820010)