Hello! Maybe we can have a mechanic here similar (or equal) to checkpoint based write throttling?
So we will be throttling for both checkpoint page buffer and WAL limit. Regards, -- Ilya Kasnacheev вт, 4 мая 2021 г. в 11:29, ткаленко кирилл <tkalkir...@yandex.ru>: > Hello everybody! > > At the moment, if there are partitions for the rebalance for which the > historical rebalance will be used, then we reserve segments in the WAL > archive (we do not allow cleaning the WAL archive) until the rebalance for > all cache groups is over. > > If a cluster is under load during the rebalance, WAL archive size may > significantly exceed limits set in > DataStorageConfiguration#getMaxWalArchiveSize until the process is > complete. This may lead to user issues and nodes may crash with the "No > space left on device" error. > > We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by > default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) > from which and up to which the WAL archive will be cleared, i.e. sets the > size of the WAL archive that will always be on the node. I propose to > replace this system property with the > DataStorageConfiguration#getWalArchiveSize in bytes, the default is > (getMaxWalArchiveSize * 0.5) as it is now. > > Main proposal: > When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel > and do not give the reservation of the WAL segments until we reach > DataStorageConfiguration#getWalArchiveSize. In this case, if there is no > segment for historical rebalance, we will automatically switch to full > rebalance. >