Hi Ilya! Then we can greatly reduce the user load on the cluster until the rebalance is over. Which can be critical for the user.
04.05.2021, 18:43, "Ilya Kasnacheev" <ilya.kasnach...@gmail.com>: > Hello! > > Maybe we can have a mechanic here similar (or equal) to checkpoint based > write throttling? > > So we will be throttling for both checkpoint page buffer and WAL limit. > > Regards, > -- > Ilya Kasnacheev > > вт, 4 мая 2021 г. в 11:29, ткаленко кирилл <tkalkir...@yandex.ru>: > >> Hello everybody! >> >> At the moment, if there are partitions for the rebalance for which the >> historical rebalance will be used, then we reserve segments in the WAL >> archive (we do not allow cleaning the WAL archive) until the rebalance for >> all cache groups is over. >> >> If a cluster is under load during the rebalance, WAL archive size may >> significantly exceed limits set in >> DataStorageConfiguration#getMaxWalArchiveSize until the process is >> complete. This may lead to user issues and nodes may crash with the "No >> space left on device" error. >> >> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by >> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) >> from which and up to which the WAL archive will be cleared, i.e. sets the >> size of the WAL archive that will always be on the node. I propose to >> replace this system property with the >> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >> (getMaxWalArchiveSize * 0.5) as it is now. >> >> Main proposal: >> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >> and do not give the reservation of the WAL segments until we reach >> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >> segment for historical rebalance, we will automatically switch to full >> rebalance.