[
https://issues.apache.org/jira/browse/IGNITE-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047993#comment-18047993
]
Roman Puchkovskiy commented on IGNITE-27345:
--------------------------------------------
The patch looks good to me
> NullPointerException in WriteIntentSwitchRequestHandler
> -------------------------------------------------------
>
> Key: IGNITE-27345
> URL: https://issues.apache.org/jira/browse/IGNITE-27345
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Filipp Shergalis
> Priority: Major
> Labels: ignite-3
> Fix For: 3.2
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
> at
> org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
> ... 24 more{noformat}
>
>
> The node was dying due to huge heap pressure. At some moment, a lot of stack
> traces of this kind were written to log.
> The exception means that either the corresponding processor was not added
> yet, or that it had already been removed.
> # If it was not added yet, this is a bug and we probably lack some
> happens-before between adding a table processor and processing messages for
> this table. But this seems unlikely as for
> TableWriteIntentSwitchReplicaRequest, the same mechanism as for other
> TableAware messages in PartitionReplicaListener is used to make sure that
> table resources are ready to process the corresponding TableAware request.
> The mechanism is taking current time from the clock (updated with the
> requester's clock time passed via request.timestamp) and then doing a schema
> sync with that time (as the table was already created earlier, it makes sure
> that table resources are prepared and installed). Nevertheless,
> WriteIntentSwitchRequestHandler does this trick itself, its code might be
> different from PartitionReplicaListener's, so it makes sense to make sure we
> don't have a bug here
> # If the table was removed, then the corresponding table processor was
> removed. In PRL, we explicitly check for null. Probably, we have to do the
> same in WriteIntentSwitchRequestHandler as well, and this seems to be the
> actual reason (and the candidate fix). Also, please take a look at
> IGNITE-26819 and the corresponding
> [PR|https://github.com/apache/ignite-3/pull/6944]; it seems that the NPE we
> see here is a consequence of the fix for IGNITE-26819 being incomplete. Just
> one thing causes worries with this explanation: the user says that they did
> not drop any tables (but they could be wrong).
> Second item seems to be the culprit, but it would be great to write the
> corresponding test this time.
> The scenario for it is:
> # Some external transaction creates a write intent
> # It's committed, but WI cleanup is not performed
> # The table is dropped
> # LWM raises enough to cause the table destruction
> # Only now do we try another attempt to switch the WI
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)