As for me, correct approach is to trigger checkpoint when we are too close to WAL archive size limit. The main purpose of these mechanism is to provide durability, so we should think about not to fail node, nor to delete data voluntary, but prevent possible data loss.
вт, 26 янв. 2021 г. в 19:13, Zhenya Stanilovsky <arzamas...@mail.ru.invalid >: > > > Hello ! > this is unclear for me, all you described near brings no info why node > work improperly and why FH can possibly fail this node. Can you explain ? > > >Hello, everyone! > > > >Currently, property DataStorageConfiguration#maxWalArchiveSize is not > working as expected by users. We can easily go beyond this limit and > overflow the disk, which will lead to errors and a crash of the node. I > propose to fix this behavior and not let WAL archive overflow. > > > >It is suggested not to add segments to the archive if we can exceed the > DataStorageConfiguration#maxWalArchiveSize and wait until space becomes > available for this. > > > >Thus, we may have a deadlock: > >Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> > need to clean WAL archive -> need to complete checkpoint (impossible > because of checkpontReadLock taken). > > > >To avoid such situations, I suggest adding a custom heuristic - do not > give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few > (default 1) segments left. > >But this will not allow us to completely avoid archive overflow > situations. Therefore, I suggest fail node by FH when a deadlock is > detected, since it could be the same if there was no disk space left. > > > > -- Sincerely yours, Ivan Daschinskiy