[ 
https://issues.apache.org/jira/browse/KUDU-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3017:
--------------------------------
    Description: 
This bug is about misreporting the root cause of the problem, so it's not easy 
to correlate the error message with the actual problem and at the phase of the 
process lifecycle. After analysis, it turned to be just another 
manifestation/consequence of 
[KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].

I saw master crashing with the following error reported in the log:

{noformat}
F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
BOOTSTRAPPING
{noformat}

It's not easy to tell at what point of master lifecycle it happened, but after 
looking around in the log and into the generated core file it became clear the 
problem was just a consequence of the conditions that triggered KUDU-3016 at 
first place:

Extra info from the log:
{noformat}
I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: Bootstrap 
complete.
I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active config: 
opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
"77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
"master0" port: 7051 } } peers { permanent_uuid: 
"2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
"master1" port: 7051 } } peers { permanent_uuid: 
"97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
"master2" port: 7051 } }
W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on tablet 
00000000000000000000000000000000 rejected due to memory pressure: the memory 
usage of this transaction (91215642) plus the current consumption (0) exceeds 
the transaction memory limit (67108864) or the limit of an ancestral memory 
tracker.
{noformat}

See the attached file for the stack trace captured in the core file.

  was:
This bug is about misreporting the root cause of the problem, so it's not easy 
to correlate the error message with the actual problem and at the phase of the 
process lifecycle. After analysis, it turned to be just another 
manifestation/consequence of 
[KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].

I saw master crashing with the following error reported in the log:

{noformat}
F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
BOOTSTRAPPING
{noformat}

It's not easy to tell at what point of master lifecycle it happened, but after 
looking around in the log and into the generated core file it became clear the 
problem was just a consequence of the conditions that triggered KUDU-3016 at 
first place:

Extra info from the log:
{noformat}
I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: Bootstrap 
complete.
I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active config: 
opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
"77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
"nrappmst3" port: 7051 } } peers { permanent_uuid: 
"2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
"nrappmst4" port: 7051 } } peers { permanent_uuid: 
"97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
"nrappmst5" port: 7051 } }
W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on tablet 
00000000000000000000000000000000 rejected due to memory pressure: the memory 
usage of this transaction (91215642) plus the current consumption (0) exceeds 
the transaction memory limit (67108864) or the limit of an ancestral memory 
tracker.
{noformat}

See the attached file for the stack trace captured in the core file.


> master crashes on attemp to replay orphaned ops in WAL, not reporting the 
> root cause of the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3017
>                 URL: https://issues.apache.org/jira/browse/KUDU-3017
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.11.1
>            Reporter: Alexey Serbin
>            Priority: Minor
>         Attachments: core.stack.xz
>
>
> This bug is about misreporting the root cause of the problem, so it's not 
> easy to correlate the error message with the actual problem and at the phase 
> of the process lifecycle. After analysis, it turned to be just another 
> manifestation/consequence of 
> [KUDU-3016|https://issues.apache.org/jira/browse/KUDU-3016].
> I saw master crashing with the following error reported in the log:
> {noformat}
> F1206 01:32:15.488359 1324967 tablet_replica.cc:138] Check failed: state_ == 
> SHUTDOWN || state_ == FAILED TabletReplica not fully shut down. State: 
> BOOTSTRAPPING
> {noformat}
> It's not easy to tell at what point of master lifecycle it happened, but 
> after looking around in the log and into the generated core file it became 
> clear the problem was just a consequence of the conditions that triggered 
> KUDU-3016 at first place:
> Extra info from the log:
> {noformat}
> I1206 01:32:15.419330 1324967 tablet_bootstrap.cc:439] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8: 
> Bootstrap complete.
> I1206 01:32:15.471163 1324967 raft_consensus.cc:340] T 
> 00000000000000000000000000000000 P 77360e3dee9f4a748e75f830554326a8 [term 164 
> FOLLOWER]: Replica starting. Triggering 11 pending transactions. Active 
> config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "77360e3dee9f4a748e75f830554326a8" member_type: VOTER last_known_addr { host: 
> "master0" port: 7051 } } peers { permanent_uuid: 
> "2a23cf2aee7549fbb63e6f8bcfb08cc3" member_type: VOTER last_known_addr { host: 
> "master1" port: 7051 } } peers { permanent_uuid: 
> "97326d428af84cf88d95eefe32eca0bd" member_type: VOTER last_known_addr { host: 
> "master2" port: 7051 } }
> W1206 01:32:15.488217 1324967 transaction_tracker.cc:122] transaction on 
> tablet 00000000000000000000000000000000 rejected due to memory pressure: the 
> memory usage of this transaction (91215642) plus the current consumption (0) 
> exceeds the transaction memory limit (67108864) or the limit of an ancestral 
> memory tracker.
> {noformat}
> See the attached file for the stack trace captured in the core file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to