Re: Hard limit WAL archive size

2021-01-26 Thread ткаленко кирилл
Hi!
No, we basically have a problem with the growth of WAL archive.


26.01.2021, 19:06, "Vishwas Bm" :
> Hi,
>
> Is this related to issue seen with
> IGNITE-13912 ?
>
> I had hit IGNITE-13912 when I was using ignite 2.9 release.
> I am yet to try my use case with the fix provided as part of IGNITE-13912
>
> Regards,
> Vishwas
>
> On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл,  wrote:
>
>>  Hello, everyone!
>>
>>  Currently, property DataStorageConfiguration#maxWalArchiveSize is not
>>  working as expected by users. We can easily go beyond this limit and
>>  overflow the disk, which will lead to errors and a crash of the node. I
>>  propose to fix this behavior and not let WAL archive overflow.
>>
>>  It is suggested not to add segments to the archive if we can exceed the
>>  DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
>>  available for this.
>>
>>  Thus, we may have a deadlock:
>>  Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
>>  need to clean WAL archive -> need to complete checkpoint (impossible
>>  because of checkpontReadLock taken).
>>
>>  To avoid such situations, I suggest adding a custom heuristic - do not
>>  give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
>>  (default 1) segments left.
>>  But this will not allow us to completely avoid archive overflow
>>  situations. Therefore, I suggest fail node by FH when a deadlock is
>>  detected, since it could be the same if there was no disk space left.


Re: Hard limit WAL archive size

2021-01-26 Thread Ivan Daschinsky
As for me, correct approach is to trigger checkpoint when we are too close
to WAL archive size limit.
The main purpose of these mechanism is to provide durability, so we
should think about not to fail node, nor to delete data voluntary,
but prevent possible data loss.

вт, 26 янв. 2021 г. в 19:13, Zhenya Stanilovsky :

>
>
> Hello !
> this is unclear for me, all you described near brings no info why node
> work improperly and why FH can possibly fail this node. Can you explain ?
>
> >Hello, everyone!
> >
> >Currently, property DataStorageConfiguration#maxWalArchiveSize is not
> working as expected by users. We can easily go beyond this limit and
> overflow the disk, which will lead to errors and a crash of the node. I
> propose to fix this behavior and not let WAL archive overflow.
> >
> >It is suggested not to add segments to the archive if we can exceed the
> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
> available for this.
> >
> >Thus, we may have a deadlock:
> >Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
> need to clean WAL archive -> need to complete checkpoint (impossible
> because of checkpontReadLock taken).
> >
> >To avoid such situations, I suggest adding a custom heuristic - do not
> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
> (default 1) segments left.
> >But this will not allow us to completely avoid archive overflow
> situations. Therefore, I suggest fail node by FH when a deadlock is
> detected, since it could be the same if there was no disk space left.
>
>
>
>



-- 
Sincerely yours, Ivan Daschinskiy


Re: Hard limit WAL archive size

2021-01-26 Thread Zhenya Stanilovsky


Hello !
this is unclear for me, all you described near brings no info why node work 
improperly and why FH can possibly fail this node. Can you explain ?
 
>Hello, everyone!
>
>Currently, property DataStorageConfiguration#maxWalArchiveSize is not working 
>as expected by users. We can easily go beyond this limit and overflow the 
>disk, which will lead to errors and a crash of the node. I propose to fix this 
>behavior and not let WAL archive overflow.
>
>It is suggested not to add segments to the archive if we can exceed the 
>DataStorageConfiguration#maxWalArchiveSize and wait until space becomes 
>available for this.
>
>Thus, we may have a deadlock:
>Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need 
>to clean WAL archive -> need to complete checkpoint (impossible because of 
>checkpontReadLock taken).
>
>To avoid such situations, I suggest adding a custom heuristic - do not give a 
>IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 
>1) segments left.
>But this will not allow us to completely avoid archive overflow situations. 
>Therefore, I suggest fail node by FH when a deadlock is detected, since it 
>could be the same if there was no disk space left. 
 
 
 
 

Re: Hard limit WAL archive size

2021-01-26 Thread Vishwas Bm
Hi,

Is this related to issue seen with
IGNITE-13912 ?

I had hit IGNITE-13912 when I was using ignite 2.9 release.
I am yet to try my use case with the fix provided as part of IGNITE-13912



Regards,
Vishwas

On Tue, 26 Jan, 2021, 21:18 ткаленко кирилл,  wrote:

> Hello, everyone!
>
> Currently, property DataStorageConfiguration#maxWalArchiveSize is not
> working as expected by users. We can easily go beyond this limit and
> overflow the disk, which will lead to errors and a crash of the node. I
> propose to fix this behavior and not let WAL archive overflow.
>
> It is suggested not to add segments to the archive if we can exceed the
> DataStorageConfiguration#maxWalArchiveSize and wait until space becomes
> available for this.
>
> Thus, we may have a deadlock:
> Get checkpontReadLock -> write to WAL -> need to rollover WAL segment ->
> need to clean WAL archive -> need to complete checkpoint (impossible
> because of checkpontReadLock taken).
>
> To avoid such situations, I suggest adding a custom heuristic - do not
> give a IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few
> (default 1) segments left.
> But this will not allow us to completely avoid archive overflow
> situations. Therefore, I suggest fail node by FH when a deadlock is
> detected, since it could be the same if there was no disk space left.
>


Hard limit WAL archive size

2021-01-26 Thread ткаленко кирилл
Hello, everyone!

Currently, property DataStorageConfiguration#maxWalArchiveSize is not working 
as expected by users. We can easily go beyond this limit and overflow the disk, 
which will lead to errors and a crash of the node. I propose to fix this 
behavior and not let WAL archive overflow.

It is suggested not to add segments to the archive if we can exceed the 
DataStorageConfiguration#maxWalArchiveSize and wait until space becomes 
available for this.

Thus, we may have a deadlock:
Get checkpontReadLock -> write to WAL -> need to rollover WAL segment -> need 
to clean WAL archive -> need to complete checkpoint (impossible because of 
checkpontReadLock taken).

To avoid such situations, I suggest adding a custom heuristic - do not give a 
IgniteCacheDatabaseSharedManager#checkpointReadLock if there are few (default 
1) segments left.
But this will not allow us to completely avoid archive overflow situations. 
Therefore, I suggest fail node by FH when a deadlock is detected, since it 
could be the same if there was no disk space left.