[ 
https://issues.apache.org/jira/browse/OAK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619405#comment-17619405
 ] 

Julian Sedding commented on OAK-9785:
-------------------------------------

I created a [PR to backport the change to the 1.22 
branch|https://github.com/apache/jackrabbit-oak/pull/733].

> Tar SegmentStore can be corrupted during compaction
> ---------------------------------------------------
>
>                 Key: OAK-9785
>                 URL: https://issues.apache.org/jira/browse/OAK-9785
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-tar
>    Affects Versions: 1.42.0
>            Reporter: Julian Sedding
>            Assignee: Julian Sedding
>            Priority: Major
>              Labels: candidate_oak_1_22, candidate_oak_1_8
>             Fix For: 1.46.0
>
>         Attachments: error.log.2022-06-09
>
>
> There is a scenario where a segment store can become corrupted, leading to 
> {{SegmentNotFoundExceptions}} with very "young" {{SegmentIds}}, i.e. in the 
> 1-2 digit millisecond range. E.g. {{SegmentId age=2ms}}.
> The scenario I observed looks as follows:
>  - a blob is "lost" from the external blob store (presumably due to incorrect 
> cloning of the instance, most likely only happens with unfortunate timing)
>  - a tail revision GC run is performed (not sure if it matters that this was 
> a tail compaction)
>  -- the missing blob is encountered during compaction
>  -- an exception other than an {{IOException}} (IIRC it was a 
> {{{}IllegalArgumentException{}}}) is thrown due to the missing blob
>  -- revision GC fails WITHOUT properly being aborted, and thus the partially 
> written revision of the compaction run is not removed
>  - more data is written on the instance
>  - a full revision GC run is performed
>  -- a referenced segment is removed due to the incorrect/confused revision 
> data
>  - the {{SegmentNotFoundException}} is first observed either during the 
> remainder of the compaction run or when the respective node is requested the 
> next time, usually during a traversal
> The root cause is in 
> [{{AbstractCompactionStrategy}}|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/file/AbstractCompactionStrategy.java#L233],
>  where only {{IOExceptions}} are caught.
> In order to improve the robustness of the code, I think we need to catch all 
> {{Throwables}}. Otherwise we cannot guarantee that compaction is correctly 
> aborted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to