First, logging should be configured to at least WARN level if not INFO. Ignite manages data internally at the page level. If you see errors about pages, it is low, low level ignite problems. The next level up is partitions. Errors involving partitions are mid low level ignite problems. The next level up is caches. Errors at the cache level are mid to high level problems. The next level is cache records. Errors in cache record handling are high level of abstraction, and the next level is client application operations.
The lower level of abstraction the errors appear, the less chance operations in general will succeed. Since the cache appears to operate mostly as expected, and there are no obvious errors in the ignite logs, most likely there is some client side logic which is deleting records, and ignite does not consider this behavior to be in error. I would recommend fine tuning cache delete method log coverage. First identify if the deletion is happening on a client connection thread pool or a thread for server initiated operations. My guess is that a client is connecting, getting a cache object, and then setting expiration on that cache connection so that all cache adds under that cache connection will have expiration applied to them. https://ignite.apache.org/docs/2.14.0/configuring-caches/expiry-policies#configuration "You can also change or set Expiry Policy for individual cache operations. This policy is used for each operation invoked on the returned cache instance." https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Client.Cache.ICacheClient-2.html?q=withExpiryPolicy#Apache_Ignite_Core_Client_Cache_ICacheClient_2_WithExpiryPolicy_Apache_Ignite_Core_Cache_Expiry_IExpiryPolicy_ On Wed, Feb 21, 2024, 19:17 Aleksej Avrutin <[email protected]> wrote: > Hello, > > A couple of days ago I encountered a strange phenomenon in our application > based on Apache Ignite .Net 2.14 with persistence (3 nodes, 1 backup per > cache). > Data in a cache started disappearing for seemingly no reason and the > amount of records could be halved (220K to 108K) overnight. I spent a > couple of days trying to find a problem in the application, crunched > hundreds megabytes of application logs but didn't manage to find a reason > to blame the application. Retention/TTL is not set for the cache. Apache > Ignite logs with the option -DIGNITE_QUIET=false also don't reveal any > anomalies (or I don't know what to look for). The data shares are expected > to be durable (based on Azure Disk) and we never had any issues with them. > RAM utilisation is normal and there's plenty of available RAM. > The Ignite cluster is hosted in a 3 node Kubernetes cluster on Azure. > > The question is: how would you recommend investigating issues like this? > What metrics and logs can I check? Is it possible to log and track > individual Remove() operations as well as SQL queries at Ignite engine > level? > > The application has been working on Ignite for years already and we didn't > encounter data loss at such scales before. It's possible that the app > wasn't used so extensively before as it is now and the problem left > unnoticed. > > My best, > Alex Avrutin >
