Hi Zakelly!

I was able to re-write the SST file with the missing block and patch my
checkpoint/savepoint, but as you would probably expect another check failed
:


*Caused by: org.rocksdb.RocksDBException: file is too short (268848910
bytes) to be an sstable/....../*

The problem is the bad sst file is in our checkpoints and savepoints. We
are likely faced with a rather complicated re-bootstrap of this operator,
which I assume should drop this SST file entirely when we drop the bad
state.

Darin


On Wed, Feb 4, 2026 at 10:57 AM Zakelly Lan <[email protected]> wrote:

> Hi Darin,
>
> I'm afraid it's very difficult to fix the corruption, the only way is to
> rewrite the whole rocksdb's MANIFEST and remove that file, or rewrite the
> SST file. Either way there will be some data loss. Or if you have enabled
> the local recovery, you may find a local copy of that checkpoint file,
> which can be used to replace the corresponding file on DFS. Or perhaps your
> corrupted file itself comes from the local copy, then disabling local
> recovery may help.
>
> It is rare, and I guess it is caused by some DFS failure or disk
> corruption. You can keep an eye on that.
>
>
> Best,
> Zakelly
>
> On Wed, Feb 4, 2026 at 12:03 PM Darin Amos via user <[email protected]>
> wrote:
>
>> Hi!
>>
>> I have a problem where my incremental checkpoint has a corrupt SST file
>> that was created weeks ago, meaning going back in time to replay the data
>> to fix the corruption is not possible, and re-bootstrapping the job is
>> extremely difficult.
>>
>> Is there a way to patch the corrupt SST file to fix my job? In this
>> particular case some data loss is acceptable in favour of system health.
>>
>> Thanks!
>>
>> Darin
>>
>>
>> % $(brew --prefix rocksdb)/bin/rocksdb_sst_dump \
>>
>>   
>> --file=./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst
>> \
>>
>>   --command=verify --verify_checksum
>>
>> options.env is 0x600003f638e0
>>
>> Process
>> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst
>>
>> Sst file format: block-based
>>
>> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst
>> is corrupted: Corruption: block checksum mismatch: stored = 3954219857,
>> computed = 4054404265, type = 1  in
>> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst
>> offset 84885876 size 11204
>>
>>
>>

Reply via email to