Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

Stanislav Lukyanov Thu, 06 May 2021 10:00:18 -0700

An interesting suggestion I heard today.

The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. 
be a number of seconds instead of a number of bytes!


I think this makes perfect sense from the user point of view.
"I want to have WAL archive for at least N hours but I have a limit of M 
gigabytes to store it".

Do we have checkpoint timestamp stored anywhere? (cp start markers?)
Perhaps we can actually implement this?

Thanks,
Stan


> On 6 May 2021, at 14:13, Stanislav Lukyanov <[email protected]> wrote:
> 
> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
> +1 to add a public property to replace 
> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
> 
> I don't like the name getWalArchiveSize - I think it's a bit confusing (is it 
> the current size? the minimal size? the target size?)
> I suggest to name the property geMintWalArchiveSize. I think that this is 
> exactly what it is - the minimal size of the archive that we want to have.
> The archive size at all times should be between min and max.
> If archive size is less than min or more than max then the system 
> functionality can degrade (e.g. historical rebalance may not work as 
> expected).
> I think these rules are intuitively understood from the "min" and "max" names.
> 
> Ilya's suggestion about throttling is great although I'd do this in a 
> different ticket.
> 
> Thanks,
> Stan
> 
>> On 5 May 2021, at 19:25, Maxim Muzafarov <[email protected]> wrote:
>> 
>> Hello, Kirill
>> 
>> +1 for this change, however, there are too many configuration settings
>> that exist for the user to configure Ignite cluster. It is better to
>> keep the options that we already have and fix the behaviour of the
>> rebalance process as you suggested.
>> 
>> On Tue, 4 May 2021 at 19:01, ткаленко кирилл <[email protected]> wrote:
>>> 
>>> Hi Ilya!
>>> 
>>> Then we can greatly reduce the user load on the cluster until the rebalance 
>>> is over. Which can be critical for the user.
>>> 
>>> 04.05.2021, 18:43, "Ilya Kasnacheev" <[email protected]>:
>>>> Hello!
>>>> 
>>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based
>>>> write throttling?
>>>> 
>>>> So we will be throttling for both checkpoint page buffer and WAL limit.
>>>> 
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>> 
>>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл <[email protected]>:
>>>> 
>>>>> Hello everybody!
>>>>> 
>>>>> At the moment, if there are partitions for the rebalance for which the
>>>>> historical rebalance will be used, then we reserve segments in the WAL
>>>>> archive (we do not allow cleaning the WAL archive) until the rebalance for
>>>>> all cache groups is over.
>>>>> 
>>>>> If a cluster is under load during the rebalance, WAL archive size may
>>>>> significantly exceed limits set in
>>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is
>>>>> complete. This may lead to user issues and nodes may crash with the "No
>>>>> space left on device" error.
>>>>> 
>>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
>>>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
>>>>> from which and up to which the WAL archive will be cleared, i.e. sets the
>>>>> size of the WAL archive that will always be on the node. I propose to
>>>>> replace this system property with the
>>>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is
>>>>> (getMaxWalArchiveSize * 0.5) as it is now.
>>>>> 
>>>>> Main proposal:
>>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
>>>>> and do not give the reservation of the WAL segments until we reach
>>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
>>>>> segment for historical rebalance, we will automatically switch to full
>>>>> rebalance.
>

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

Reply via email to