Re: Data loss in an Ignite application

Aleksej Avrutin Fri, 23 Feb 2024 16:19:12 -0800

Stephen,

Thank you for the message. At last, I've found the root cause of the issue.
It was an application bug (expected) but it wasn't the most apparent one.
Out of despair I decided to check all the components of the application
including Ignite. The good thing is that now I have better knowledge of how
to troubleshoot issues like this.


My best,
Alex Avrutin


On Fri, Feb 23, 2024 at 10:38 AM Stephen Darlington <[email protected]>
wrote:

> Is there a pattern to the lost records? Is it old records? Records for a
> particular customer? Records stored on a specific node or partition?
>
> On Thu, 22 Feb 2024 at 21:14, Aleksej Avrutin <[email protected]>
> wrote:
>
>> Jeremy,
>>
>> Thank you for the response. I reviewed cache properties using GG Control
>> Center and there was nothing in the cache props that would lead me to the
>> conclusion that any expiry policy/TTL is set up for the cache. It wasn't
>> set on the operation level, either.
>>
>> I decided to delete the cache entirely and re-create it. Tomorrow I'll
>> check if it helps.
>>
>> My best,
>> Alex Avrutin
>>
>>
>> On Thu, Feb 22, 2024 at 3:56 AM Jeremy McMillan <
>> [email protected]> wrote:
>>
>>> First, logging should be configured to at least WARN level if not INFO.
>>>
>>> Ignite manages data internally at the page level. If you see errors
>>> about pages, it is low, low level ignite problems. The next level up is
>>> partitions. Errors involving partitions are mid low level ignite problems.
>>> The next level up is caches. Errors at the cache level are mid to high
>>> level problems. The next level is cache records. Errors in cache record
>>> handling are high level of abstraction, and the next level is client
>>> application operations.
>>>
>>> The lower level of abstraction the errors appear, the less chance
>>> operations in general will succeed. Since the cache appears to operate
>>> mostly as expected, and there are no obvious errors in the ignite logs,
>>> most likely there is some client side logic which is deleting records, and
>>> ignite does not consider this behavior to be in error.
>>>
>>> I would recommend fine tuning cache delete method log coverage. First
>>> identify if the deletion is happening on a client connection thread pool or
>>> a thread for server initiated operations.
>>>
>>> My guess is that a client is connecting, getting a cache object, and
>>> then setting expiration on that cache connection so that all cache adds
>>> under that cache connection will have expiration applied to them.
>>>
>>>
>>> https://ignite.apache.org/docs/2.14.0/configuring-caches/expiry-policies#configuration
>>>
>>> "You can also change or set Expiry Policy for individual cache
>>> operations. This policy is used for each operation invoked on the returned
>>> cache instance."
>>>
>>>
>>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Client.Cache.ICacheClient-2.html?q=withExpiryPolicy#Apache_Ignite_Core_Client_Cache_ICacheClient_2_WithExpiryPolicy_Apache_Ignite_Core_Cache_Expiry_IExpiryPolicy_
>>>
>>> On Wed, Feb 21, 2024, 19:17 Aleksej Avrutin <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> A couple of days ago I encountered a strange phenomenon in our
>>>> application based on Apache Ignite .Net 2.14 with persistence (3 nodes, 1
>>>> backup per cache).
>>>> Data in a cache started disappearing for seemingly no reason and the
>>>> amount of records could be halved (220K to 108K) overnight. I spent a
>>>> couple of days trying to find a problem in the application, crunched
>>>> hundreds megabytes of application logs but didn't manage to find a reason
>>>> to blame the application. Retention/TTL is not set for the cache. Apache
>>>> Ignite logs with the option -DIGNITE_QUIET=false also don't reveal any
>>>> anomalies (or I don't know what to look for). The data shares are expected
>>>> to be durable (based on Azure Disk) and we never had any issues with them.
>>>> RAM utilisation is normal and there's plenty of available RAM.
>>>> The Ignite cluster is hosted in a 3 node Kubernetes cluster on Azure.
>>>>
>>>> The question is: how would you recommend investigating issues like
>>>> this? What metrics and logs can I check? Is it possible to log and track
>>>> individual Remove() operations as well as SQL queries at Ignite engine
>>>> level?
>>>>
>>>> The application has been working on Ignite for years already and we
>>>> didn't encounter data loss at such scales before. It's possible that the
>>>> app wasn't used so extensively before as it is now and the problem left
>>>> unnoticed.
>>>>
>>>> My best,
>>>> Alex Avrutin
>>>>
>>>

Re: Data loss in an Ignite application

Reply via email to