[ 
https://issues.apache.org/jira/browse/IGNITE-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047993#comment-18047993
 ] 

Roman Puchkovskiy commented on IGNITE-27345:
--------------------------------------------

The patch looks good to me

> NullPointerException in WriteIntentSwitchRequestHandler
> -------------------------------------------------------
>
>                 Key: IGNITE-27345
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27345
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Filipp Shergalis
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.2
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
>  
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
> ... 24 more{noformat}
>  
>  
> The node was dying due to huge heap pressure. At some moment, a lot of stack 
> traces of this kind were written to log.
> The exception means that either the corresponding processor was not added 
> yet, or that it had already been removed.
>  # If it was not added yet, this is a bug and we probably lack some 
> happens-before between adding a table processor and processing messages for 
> this table. But this seems unlikely as for 
> TableWriteIntentSwitchReplicaRequest, the same mechanism as for other 
> TableAware messages in PartitionReplicaListener is used to make sure that 
> table resources are ready to process the corresponding TableAware request. 
> The mechanism is taking current time from the clock (updated with the 
> requester's clock time passed via request.timestamp) and then doing a schema 
> sync with that time (as the table was already created earlier, it makes sure 
> that table resources are prepared and installed). Nevertheless, 
> WriteIntentSwitchRequestHandler does this trick itself, its code might be 
> different from PartitionReplicaListener's, so it makes sense to make sure we 
> don't have a bug here
>  # If the table was removed, then the corresponding table processor was 
> removed. In PRL, we explicitly check for null. Probably, we have to do the 
> same in WriteIntentSwitchRequestHandler as well, and this seems to be the 
> actual reason (and the candidate fix). Also, please take a look at 
> IGNITE-26819 and the corresponding 
> [PR|https://github.com/apache/ignite-3/pull/6944]; it seems that the NPE we 
> see here is a consequence of the fix for IGNITE-26819 being incomplete. Just 
> one thing causes worries with this explanation: the user says that they did 
> not drop any tables (but they could be wrong).
> Second item seems to be the culprit, but it would be great to write the 
> corresponding test this time.
> The scenario for it is:
>  # Some external transaction creates a write intent
>  # It's committed, but WI cleanup is not performed
>  # The table is dropped
>  # LWM raises enough to cause the table destruction
>  # Only now do we try another attempt to switch the WI
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to