[
https://issues.apache.org/jira/browse/PHOENIX-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tanuj Khurana updated PHOENIX-7611:
-----------------------------------
Description:
The memory corruption has surfaced in the form of segmentation faults which
crashes the Regionserver. We have observed this in production in our
environment as well as in ITs. We already have PHOENIX-7419 open for it. I was
also hitting this issue when working on PHOENIX-7591 There sometimes the test
would fail with a FATAL error message of SIGSEGV. But more often the test would
fail with a silent corruption. After adding more logging, what I found was that
some of the Cell references we were storing in IndexRegionObserver were
getting corrupted.
I started looking around in HBase for similar corruptions and found that from
HBase 2 onwards the contract with the coprocessor for preBatchMutate hook says:
*Do not retain references to any Cells in Mutations* beyond the life of this
invocation. If need a Cell reference for later use, copy the cell and use that
IndexRegionObserver maintains the row state in the memory as a Put mutation
which references to the Cells in the Mutation to handle concurrent updates and
the lifetime of these references exceeds the invocation of the hook. It seems
in some cases these cells can be backed by off-heap memory which can be
reclaimed or reused causing corruptions.
This also lines up with the stack trace attached to PHOENIX-7419
(hs_err_pid783375.log)
{code:java}
v ~StubRoutines::jbyte_disjoint_arraycopy
J 23481 C2
org.apache.hadoop.hbase.unsafe.HBasePlatformDependent.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V
(22 bytes) @ 0x00007fb765360c32 [0x00007fb765360be0+0x52]
j
org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+56
j
org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+105
j
org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+65
j
org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+56
J 24630 C2
org.apache.phoenix.coprocessor.GlobalIndexRegionScanner.apply(Lorg/apache/hadoop/hbase/client/Put;Lorg/apache/hadoop/hbase/client/Put;)V
(167 bytes) @ 0x00007fb7656262e0 [0x00007fb765625ca0+0x640]
J 24258 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.applyPendingPutMutations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V
(430 bytes) @ 0x00007fb7654ac234 [0x00007fb7654aa880+0x19b4]
j
org.apache.phoenix.hbase.index.IndexRegionObserver.prepareDataRowStates(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V+30
J 25543 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutateWithExceptions(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(1004 bytes) @ 0x00007fb764ef1ffc [0x00007fb764eef7c0+0x283c]
J 25542 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutate(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(76 bytes) @ 0x00007fb762d1dc24 [0x00007fb762d1db00+0x124]
J 22752 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$28.call(Ljava/lang/Object;)V
(17 bytes) @ 0x00007fb7629b21d4 [0x00007fb7629b1f00+0x2d4]
J 14450 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver()V
(70 bytes) @ 0x00007fb762483240 [0x00007fb7624830c0+0x180]
J 18110 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(Lorg/apache/hadoop/hbase/coprocessor/CoprocessorHost$ObserverOperation;)Z
(274 bytes) @ 0x00007fb76463c74c [0x00007fb76463c320+0x42c]
J 23033 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(42 bytes) @ 0x00007fb762b39dcc [0x00007fb762b39640+0x78c]
J 14181 C1
org.apache.hadoop.hbase.regionserver.HRegion$MutationBatchOperation.prepareMiniBatchOperations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;JLjava/util/List;)V
(105 bytes) @ 0x00007fb763a21b3c [0x00007fb763a21380+0x7bc]
J 14199 C1
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)V
(970 bytes) @ 0x00007fb763a37a94 [0x00007fb763a36f20+0xb74]
J 13124 C1
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)[Lorg/apache/hadoop/hbase/regionserver/OperationStatus;
(354 bytes) @ 0x00007fb7636a5a64 [0x00007fb7636a5320+0x744] {code}
This contract actually applies to all the methods in the RegionObserver
contract and was updated in HBASE-15735 introduced in HBase 2. Phoenix has
several coprocessors which implement the RegionObserver interface. We need to
investigate all such implementations and fix them if they are holding on to
cell references after the invocation of the hook API.
Two patterns I have seen are:
1. We directly store the reference to the Cell or in a collection like
List<Cell>
2. We store indirectly like in a Mutation object.
It seems this is only a problem if we store references to Cells which extend
the ByteBufferKeyValue which extends the ByteBufferExtendedCell since then can
be backed by off-heap memory.
KeyValue instances seem fine (the ones returned by
[GenericKeyValueBuilder.java|https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/hbase/index/util/GenericKeyValueBuilder.java])
was:
The memory corruption has surfaced in the form of segmentation faults which
crashes the Regionserver. We have observed this in production in our
environment as well as in ITs. We already have PHOENIX-7419 open for it. I was
also hitting this issue when working on PHOENIX-7591 There sometimes the test
would fail with a FATAL error message of SIGSEGV. But more often the test would
fail with a silent corruption. After adding more logging, what I found was that
some of the Cell references we were storing in IndexRegionObserver were
getting corrupted.
I started looking around in HBase for similar corruptions and found that from
HBase 2 onwards the contract with the coprocessor for preBatchMutate hook says:
*Do not retain references to any Cells in Mutations* beyond the life of this
invocation. If need a Cell reference for later use, copy the cell and use that
IndexRegionObserver maintains the row state in the memory as a Put mutation
which references to the Cells in the Mutation to handle concurrent updates and
the lifetime of these references exceeds the invocation of the hook. It seems
in some cases these cells can be backed by off-heap memory which can be
reclaimed or reused causing corruptions.
This also lines up with the stack trace attached to PHOENIX-7419
([^hs_err_pid783375.log)]
{code:java}
v ~StubRoutines::jbyte_disjoint_arraycopy
J 23481 C2
org.apache.hadoop.hbase.unsafe.HBasePlatformDependent.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V
(22 bytes) @ 0x00007fb765360c32 [0x00007fb765360be0+0x52]
j
org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+56
j
org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+105
j
org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+65
j
org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+56
J 24630 C2
org.apache.phoenix.coprocessor.GlobalIndexRegionScanner.apply(Lorg/apache/hadoop/hbase/client/Put;Lorg/apache/hadoop/hbase/client/Put;)V
(167 bytes) @ 0x00007fb7656262e0 [0x00007fb765625ca0+0x640]
J 24258 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.applyPendingPutMutations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V
(430 bytes) @ 0x00007fb7654ac234 [0x00007fb7654aa880+0x19b4]
j
org.apache.phoenix.hbase.index.IndexRegionObserver.prepareDataRowStates(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V+30
J 25543 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutateWithExceptions(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(1004 bytes) @ 0x00007fb764ef1ffc [0x00007fb764eef7c0+0x283c]
J 25542 C1
org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutate(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(76 bytes) @ 0x00007fb762d1dc24 [0x00007fb762d1db00+0x124]
J 22752 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$28.call(Ljava/lang/Object;)V
(17 bytes) @ 0x00007fb7629b21d4 [0x00007fb7629b1f00+0x2d4]
J 14450 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver()V
(70 bytes) @ 0x00007fb762483240 [0x00007fb7624830c0+0x180]
J 18110 C2
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(Lorg/apache/hadoop/hbase/coprocessor/CoprocessorHost$ObserverOperation;)Z
(274 bytes) @ 0x00007fb76463c74c [0x00007fb76463c320+0x42c]
J 23033 C1
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
(42 bytes) @ 0x00007fb762b39dcc [0x00007fb762b39640+0x78c]
J 14181 C1
org.apache.hadoop.hbase.regionserver.HRegion$MutationBatchOperation.prepareMiniBatchOperations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;JLjava/util/List;)V
(105 bytes) @ 0x00007fb763a21b3c [0x00007fb763a21380+0x7bc]
J 14199 C1
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)V
(970 bytes) @ 0x00007fb763a37a94 [0x00007fb763a36f20+0xb74]
J 13124 C1
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)[Lorg/apache/hadoop/hbase/regionserver/OperationStatus;
(354 bytes) @ 0x00007fb7636a5a64 [0x00007fb7636a5320+0x744] {code}
This contract actually applies to all the methods in the RegionObserver
contract and was updated in HBASE-15735 introduced in HBase 2. Phoenix has
several coprocessors which implement the RegionObserver interface. We need to
investigate all such implementations and fix them if they are holding on to
cell references after the invocation of the hook API.
Two patterns I have seen are:
1. We directly store the reference to the Cell or in a collection like
List<Cell>
2. We store indirectly like in a Mutation object.
It seems this is only a problem if we store references to Cells which extend
the ByteBufferKeyValue which extends the ByteBufferExtendedCell since then can
be backed by off-heap memory.
KeyValue instances seem fine (the ones returned by
[GenericKeyValueBuilder.java|https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/hbase/index/util/GenericKeyValueBuilder.java])
> Memory corruption issue in Phoenix coprocessors in HBase 2
> ----------------------------------------------------------
>
> Key: PHOENIX-7611
> URL: https://issues.apache.org/jira/browse/PHOENIX-7611
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 5.0.0, 5.1.0, 5.1.1, 5.2.0, 5.1.2, 5.1.3, 5.2.1
> Reporter: Tanuj Khurana
> Priority: Major
>
> The memory corruption has surfaced in the form of segmentation faults which
> crashes the Regionserver. We have observed this in production in our
> environment as well as in ITs. We already have PHOENIX-7419 open for it. I
> was also hitting this issue when working on PHOENIX-7591 There sometimes the
> test would fail with a FATAL error message of SIGSEGV. But more often the
> test would fail with a silent corruption. After adding more logging, what I
> found was that some of the Cell references we were storing in
> IndexRegionObserver were getting corrupted.
> I started looking around in HBase for similar corruptions and found that from
> HBase 2 onwards the contract with the coprocessor for preBatchMutate hook
> says:
> *Do not retain references to any Cells in Mutations* beyond the life of this
> invocation. If need a Cell reference for later use, copy the cell and use
> that
> IndexRegionObserver maintains the row state in the memory as a Put mutation
> which references to the Cells in the Mutation to handle concurrent updates
> and the lifetime of these references exceeds the invocation of the hook. It
> seems in some cases these cells can be backed by off-heap memory which can be
> reclaimed or reused causing corruptions.
> This also lines up with the stack trace attached to PHOENIX-7419
> (hs_err_pid783375.log)
> {code:java}
> v ~StubRoutines::jbyte_disjoint_arraycopy
> J 23481 C2
> org.apache.hadoop.hbase.unsafe.HBasePlatformDependent.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V
> (22 bytes) @ 0x00007fb765360c32 [0x00007fb765360be0+0x52]
> j
> org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+56
> j
> org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+105
> j
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+65
> j
> org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+56
> J 24630 C2
> org.apache.phoenix.coprocessor.GlobalIndexRegionScanner.apply(Lorg/apache/hadoop/hbase/client/Put;Lorg/apache/hadoop/hbase/client/Put;)V
> (167 bytes) @ 0x00007fb7656262e0 [0x00007fb765625ca0+0x640]
> J 24258 C1
> org.apache.phoenix.hbase.index.IndexRegionObserver.applyPendingPutMutations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V
> (430 bytes) @ 0x00007fb7654ac234 [0x00007fb7654aa880+0x19b4]
> j
> org.apache.phoenix.hbase.index.IndexRegionObserver.prepareDataRowStates(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;Lorg/apache/phoenix/hbase/index/IndexRegionObserver$BatchMutateContext;J)V+30
> J 25543 C1
> org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutateWithExceptions(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
> (1004 bytes) @ 0x00007fb764ef1ffc [0x00007fb764eef7c0+0x283c]
> J 25542 C1
> org.apache.phoenix.hbase.index.IndexRegionObserver.preBatchMutate(Lorg/apache/hadoop/hbase/coprocessor/ObserverContext;Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
> (76 bytes) @ 0x00007fb762d1dc24 [0x00007fb762d1db00+0x124]
> J 22752 C1
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$28.call(Ljava/lang/Object;)V
> (17 bytes) @ 0x00007fb7629b21d4 [0x00007fb7629b1f00+0x2d4]
> J 14450 C2
> org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver()V
> (70 bytes) @ 0x00007fb762483240 [0x00007fb7624830c0+0x180]
> J 18110 C2
> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(Lorg/apache/hadoop/hbase/coprocessor/CoprocessorHost$ObserverOperation;)Z
> (274 bytes) @ 0x00007fb76463c74c [0x00007fb76463c320+0x42c]
> J 23033 C1
> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preBatchMutate(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;)V
> (42 bytes) @ 0x00007fb762b39dcc [0x00007fb762b39640+0x78c]
> J 14181 C1
> org.apache.hadoop.hbase.regionserver.HRegion$MutationBatchOperation.prepareMiniBatchOperations(Lorg/apache/hadoop/hbase/regionserver/MiniBatchOperationInProgress;JLjava/util/List;)V
> (105 bytes) @ 0x00007fb763a21b3c [0x00007fb763a21380+0x7bc]
> J 14199 C1
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)V
> (970 bytes) @ 0x00007fb763a37a94 [0x00007fb763a36f20+0xb74]
> J 13124 C1
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(Lorg/apache/hadoop/hbase/regionserver/HRegion$BatchOperation;)[Lorg/apache/hadoop/hbase/regionserver/OperationStatus;
> (354 bytes) @ 0x00007fb7636a5a64 [0x00007fb7636a5320+0x744] {code}
> This contract actually applies to all the methods in the RegionObserver
> contract and was updated in HBASE-15735 introduced in HBase 2. Phoenix has
> several coprocessors which implement the RegionObserver interface. We need to
> investigate all such implementations and fix them if they are holding on to
> cell references after the invocation of the hook API.
> Two patterns I have seen are:
> 1. We directly store the reference to the Cell or in a collection like
> List<Cell>
> 2. We store indirectly like in a Mutation object.
> It seems this is only a problem if we store references to Cells which extend
> the ByteBufferKeyValue which extends the ByteBufferExtendedCell since then
> can be backed by off-heap memory.
> KeyValue instances seem fine (the ones returned by
> [GenericKeyValueBuilder.java|https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/hbase/index/util/GenericKeyValueBuilder.java])
--
This message was sent by Atlassian Jira
(v8.20.10#820010)