An interesting suggestion I heard today. The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. be a number of seconds instead of a number of bytes!
I think this makes perfect sense from the user point of view. "I want to have WAL archive for at least N hours but I have a limit of M gigabytes to store it". Do we have checkpoint timestamp stored anywhere? (cp start markers?) Perhaps we can actually implement this? Thanks, Stan > On 6 May 2021, at 14:13, Stanislav Lukyanov <stanlukya...@gmail.com> wrote: > > +1 to cancel WAL reservation on reaching getMaxWalArchiveSize > +1 to add a public property to replace > IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE > > I don't like the name getWalArchiveSize - I think it's a bit confusing (is it > the current size? the minimal size? the target size?) > I suggest to name the property geMintWalArchiveSize. I think that this is > exactly what it is - the minimal size of the archive that we want to have. > The archive size at all times should be between min and max. > If archive size is less than min or more than max then the system > functionality can degrade (e.g. historical rebalance may not work as > expected). > I think these rules are intuitively understood from the "min" and "max" names. > > Ilya's suggestion about throttling is great although I'd do this in a > different ticket. > > Thanks, > Stan > >> On 5 May 2021, at 19:25, Maxim Muzafarov <mmu...@apache.org> wrote: >> >> Hello, Kirill >> >> +1 for this change, however, there are too many configuration settings >> that exist for the user to configure Ignite cluster. It is better to >> keep the options that we already have and fix the behaviour of the >> rebalance process as you suggested. >> >> On Tue, 4 May 2021 at 19:01, ткаленко кирилл <tkalkir...@yandex.ru> wrote: >>> >>> Hi Ilya! >>> >>> Then we can greatly reduce the user load on the cluster until the rebalance >>> is over. Which can be critical for the user. >>> >>> 04.05.2021, 18:43, "Ilya Kasnacheev" <ilya.kasnach...@gmail.com>: >>>> Hello! >>>> >>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based >>>> write throttling? >>>> >>>> So we will be throttling for both checkpoint page buffer and WAL limit. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл <tkalkir...@yandex.ru>: >>>> >>>>> Hello everybody! >>>>> >>>>> At the moment, if there are partitions for the rebalance for which the >>>>> historical rebalance will be used, then we reserve segments in the WAL >>>>> archive (we do not allow cleaning the WAL archive) until the rebalance for >>>>> all cache groups is over. >>>>> >>>>> If a cluster is under load during the rebalance, WAL archive size may >>>>> significantly exceed limits set in >>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is >>>>> complete. This may lead to user issues and nodes may crash with the "No >>>>> space left on device" error. >>>>> >>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by >>>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) >>>>> from which and up to which the WAL archive will be cleared, i.e. sets the >>>>> size of the WAL archive that will always be on the node. I propose to >>>>> replace this system property with the >>>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >>>>> (getMaxWalArchiveSize * 0.5) as it is now. >>>>> >>>>> Main proposal: >>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >>>>> and do not give the reservation of the WAL segments until we reach >>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >>>>> segment for historical rebalance, we will automatically switch to full >>>>> rebalance. >