[ https://issues.apache.org/jira/browse/OAK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619405#comment-17619405 ]
Julian Sedding commented on OAK-9785: ------------------------------------- I created a [PR to backport the change to the 1.22 branch|https://github.com/apache/jackrabbit-oak/pull/733]. > Tar SegmentStore can be corrupted during compaction > --------------------------------------------------- > > Key: OAK-9785 > URL: https://issues.apache.org/jira/browse/OAK-9785 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar > Affects Versions: 1.42.0 > Reporter: Julian Sedding > Assignee: Julian Sedding > Priority: Major > Labels: candidate_oak_1_22, candidate_oak_1_8 > Fix For: 1.46.0 > > Attachments: error.log.2022-06-09 > > > There is a scenario where a segment store can become corrupted, leading to > {{SegmentNotFoundExceptions}} with very "young" {{SegmentIds}}, i.e. in the > 1-2 digit millisecond range. E.g. {{SegmentId age=2ms}}. > The scenario I observed looks as follows: > - a blob is "lost" from the external blob store (presumably due to incorrect > cloning of the instance, most likely only happens with unfortunate timing) > - a tail revision GC run is performed (not sure if it matters that this was > a tail compaction) > -- the missing blob is encountered during compaction > -- an exception other than an {{IOException}} (IIRC it was a > {{{}IllegalArgumentException{}}}) is thrown due to the missing blob > -- revision GC fails WITHOUT properly being aborted, and thus the partially > written revision of the compaction run is not removed > - more data is written on the instance > - a full revision GC run is performed > -- a referenced segment is removed due to the incorrect/confused revision > data > - the {{SegmentNotFoundException}} is first observed either during the > remainder of the compaction run or when the respective node is requested the > next time, usually during a traversal > The root cause is in > [{{AbstractCompactionStrategy}}|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/file/AbstractCompactionStrategy.java#L233], > where only {{IOExceptions}} are caught. > In order to improve the robustness of the code, I think we need to catch all > {{Throwables}}. Otherwise we cannot guarantee that compaction is correctly > aborted. -- This message was sent by Atlassian Jira (v8.20.10#820010)