Re: Ignite Cluster Config Issue

2021-11-26 Thread Henrik Y

Thanks. I have met the same question.This helps me to understand it.


On 2021/11/26 12:40, Gurmehar Kalra wrote:

Issue got resolved after adding below line of code.

ignite.cluster().baselineAutoAdjustEnabled(*true*);

ignite.cluster().baselineAutoAdjustTimeout(1);



Re: [2.11.0]: 'B+Tree is corrupted' exception in GridCacheTtlManager.expire() and PartitionsEvictManager$PartitionEvictionTask.run() on node start

2021-11-26 Thread Sergey Korotkov
Hello Denis,

* Yes, we run 2.11.0 against the old 2.8.1 PDS.
* And no, it's not necessarily the compatibility problem since
corruption was detected both in cache with and without old records. But
yes, caches were created by 2.8.1 version.
* I also forget to say that we did yet another test before the long run
- stop one ignite node for about 3 hours to emulate node failure.  Note,
that datf corruption was detected afterwards on the another ignite node.

In more details the following steps were done:

1. Setup a clean 2.8.1 cluster and load historic data for 18 days to the
1st (of 3) caches we use with expiration. Data  was loaded the way that
the most old records start to exipire next day.  Two other caches
remains empty.

2. Start application and let is generate some data to 1st cache (ttl=18
day) and to the 2nd one (ttl=1 day)

3. The same day stop application, deactivate and stop cluster, install
ignite 2.11.0, start 2.11.0 cluster.

4. Start application and invoke our custom warm-up procedure which reads
all record from all caches to put it into the memory. So this time all
data looks good at least for reading I suppose.

5. Let application work for a weekend under load.

6. After weekend not stopping the load from application stop one of the
ignite nodes for about 3 hours

7. Start ignite node.   No problems were detected in appication behouour.

8. Let application work under load for 5 days.

9. Stop application  and perform the steps for ignite cluster restart
described in the orihinal message.  One node crashed in ttl thread after
the cluster activated.

Hope, this would help or lead to some other thoughts.

Thanks,

--

  Sergey


24.11.2021 2:10, Denis Chudov пишет:
> Hi Sergey!
>
> Thank you for providing details.
> Have I understood correctly that you run newer version of Ignite on
> older persistent files? Is there any possibility that some data in
> your caches survived that 5 days of cluster work?
> I'm just trying to exclude any compatibility problems,
> like https://issues.apache.org/jira/browse/IGNITE-14252
> 
>
> Denis Chudov
> Software Engineer, Moscow
> +7 905 5775239
> https://www.gridgain.com 
> Powered by Apache® Ignite™
>
>
> On Tue, Nov 23, 2021 at 8:32 AM Sergey Korotkov
> mailto:serge.korot...@gmail.com>> wrote:
>
> Hello Denis,
>
> Yes, as I said in the original message we do use the expiration on
> persistent caches.
>
> The corruptedPages_2021-11-08_11-03-21_999.txt and
> corruptedPages_2021-11-09_12-43-12_449.txt files were generated by
> Ignite on crash. They show that two different caches were
> affected. The first one during the expiration and the second (next
> day) during rebalance eviction.  Both caches are persistent and
> use the expiration.
>
> I also run the diagnostic utility (IgniteWalConverter) the way it
> is recommended in the error message (output attached as diag-*
> files). 
>
> Is there any usefull information in these diag-* files which can
> help to understand what and how was corruped in particular?
>
> ***
>
> Generally this was a test run of new 2.11.0 version in test
> environment. A goal was to check if new version works fine with
> out application and also can be safely stopped/started for
> maintenance. We do that since run into the similar problem with
> 'B+Tree is corrupted' on production during eviction rebalance
> (with 2.8.1).  We see two similar issues fixed in 2.11.0: 
> https://issues.apache.org/jira/browse/IGNITE-12489
>  and
> https://issues.apache.org/jira/browse/IGNITE-14093
>  and consider
> upgrade if it would help.   By the way fix of the IGNITE-12489
> (https://github.com/apache/ignite/pull/8358/commits
> ) contains a
> lot changes with several attempts. May be it just fixes not all
> situations?
>
> Before the deactivation cluster works fine under our usual load
> about 5 days. Load is about 300 requests per second each consists
> of several reads and single write to caches with the expiration
> turned on.  After that we stop / start the cluster to emulate the
> situation we had on production with 2.8.1 (load from our
> application was stopped as well before the deactivation request).
>
> ***
>
> Caches configurations. The first one has an affinity key and
> interceptor
>
>   public CacheConfiguration
> getEdenContactHistoryCacheConfiguration() {
>     CacheConfiguration
> cacheConfiguration = new CacheConfiguration<>();
>     cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>     cacheConfiguration.setAffinity(new
> RendezvousAffinityFunction(false, 1024));
>     cacheConfiguration.setBackups(1);
>     

Re: [2.11.0]: 'B+Tree is corrupted' exception in GridCacheTtlManager.expire() and PartitionsEvictManager$PartitionEvictionTask.run() on node start

2021-11-26 Thread Sergey Korotkov
Hello, Zhenya

26.11.2021 12:32, Zhenya Stanilovsky пишет:
> probably this is the
> case: https://issues.apache.org/jira/browse/IGNITE-15990
> 
>  

Thanks for pointing!  I also noticed this one.

I investigate more and suspect that it can be the same issue (concurrent
inserts and deletes on ttl) but on the another level.  Issue above is
about the corrupted B+ Tree structure as such.

But in our case the problem is in the data page as such.  The call stack
and error message (Item not found: 3) shows that the data page doesn't
contain the "indirect" item with id = 3 in the items array.  May be it
is really missed or may be the invariant that the inderect items are
stored in sorted order is broken so binary search in
AbstractDataPageIO.findIndirectItemIndex() fails.

Thanks,

- -

  Sergey

>  
>  
>
>
> Hello Denis,
>
> Yes, as I said in the original message we do use the expiration on
> persistent caches.
>
> The corruptedPages_2021-11-08_11-03-21_999.txt and
> corruptedPages_2021-11-09_12-43-12_449.txt files were generated by
> Ignite on crash. They show that two different caches were
> affected. The first one during the expiration and the second (next
> day) during rebalance eviction.  Both caches are persistent and
> use the expiration.
>
> I also run the diagnostic utility (IgniteWalConverter) the way it
> is recommended in the error message (output attached as diag-*
> files). 
>
> Is there any usefull information in these diag-* files which can
> help to understand what and how was corruped in particular?
>
> ***
>
> Generally this was a test run of new 2.11.0 version in test
> environment. A goal was to check if new version works fine with
> out application and also can be safely stopped/started for
> maintenance. We do that since run into the similar problem with
> 'B+Tree is corrupted' on production during eviction rebalance
> (with 2.8.1).  We see two similar issues fixed in 2.11.0: 
> https://issues.apache.org/jira/browse/IGNITE-12489
>  and
> https://issues.apache.org/jira/browse/IGNITE-14093
>  and consider
> upgrade if it would help.   By the way fix of the IGNITE-12489
> (https://github.com/apache/ignite/pull/8358/commits) contains a
> lot changes with several attempts. May be it just fixes not all
> situations?
>
> Before the deactivation cluster works fine under our usual load
> about 5 days. Load is about 300 requests per second each consists
> of several reads and single write to caches with the expiration
> turned on.  After that we stop / start the cluster to emulate the
> situation we had on production with 2.8.1 (load from our
> application was stopped as well before the deactivation request).
>
> ***
>
> Caches configurations. The first one has an affinity key and
> interceptor
>
>   public CacheConfiguration
> getEdenContactHistoryCacheConfiguration() {
>     CacheConfiguration
> cacheConfiguration = new CacheConfiguration<>();
>     cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>     cacheConfiguration.setAffinity(new
> RendezvousAffinityFunction(false, 1024));
>     cacheConfiguration.setBackups(1);
>     cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>     int expirationDays =
> appConfig.getContactHistoryEdenExpirationDays();
>     cacheConfiguration
>     .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
> Duration(TimeUnit.DAYS, expirationDays)));
>     cacheConfiguration.setInterceptor(new
> ContactHistoryInterceptor());
>     return cacheConfiguration;
>   }
>
> public class ContactHistoryKey {
>   String sOfferId;
>
>   @AffinityKeyMapped
>   String subsIdAffinityKey;
> }
>
>   CacheConfiguration getChannelOfferIdCache() {
>     CacheConfiguration cacheConfiguration = new
> CacheConfiguration<>();
>     cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>     cacheConfiguration.setAffinity(new
> RendezvousAffinityFunction(false, 1024));
>     cacheConfiguration.setBackups(1);
>     cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>     int expirationDays =
> appConfig.getChannelOfferIdCacheExpirationDays();
>     cacheConfiguration
>     .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
> Duration(TimeUnit.DAYS, expirationDays)));
>     return cacheConfiguration;
>   }
>
> ***
>
> As for the other details.  Not sure is it relevant or not.
> Deactivation was relatevily long and log contains a lot of
> warnings between  2021-11-08 10:54:44 and 2021-11-08 10:59:33. 
> Also there was a page locks dump at 10:56:47,567.  A lot of locks
>