gswcomputing opened a new issue, #12623:
URL: https://github.com/apache/ignite/issues/12623
Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production
environment with a 15 server-nodes cluster.
A deadlock occurred when one of the nodes(Replace with ip1) was executing
`org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery)`.
Thread stack is as follows:
"xxx" Id=317 TIMED_WAITING on
java.util.concurrent.CountDownLatch$Sync@9342695
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CountDownLatch$Sync@9342695
at
[email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown
Source)
at
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown
Source)
at
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown
Source)
at [email protected]/java.util.concurrent.CountDownLatch.await(Unknown
Source)
at
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:8228)
at
org.apache.ignite.internal.processors.query.h2.twostep.ReduceQueryRun.tryMapToSources(ReduceQueryRun.java:218)
at
org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1065)
at
org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:448)
at
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$5.iterator(IgniteH2Indexing.java:1447)
at
org.apache.ignite.internal.processors.cache.QueryCursorImpl.iter(QueryCursorImpl.java:102)
at
org.apache.ignite.internal.processors.query.h2.RegisteredQueryCursor.iter(RegisteredQueryCursor.java:91)
at
org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:92)
By checking the logs, it was found that one of the nodes in the cluster
restarted while the query was being executed.
reboot system boot 5.10.0-136.12.0. Mon Mar 4 19:51 - 15:10 (3+19:19)
At this time, checking the latest topology baseline, it was found that the
node where the thread was stuck was only the one with my own IP:
globalState=DiscoveryDataClusterState [state=ACTIVE,
lastStateChangeTime=xxx, baselineTopology=BaselineTopology [id=0,
branchingHash=-708844738, branchingType='New BaselineTopology',
baselineNodes=[ip1:port1]]
My ignite configuration is as follows:
IgniteConfiguration igniteCfg = new IgniteConfiguration();
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addressList:[15 nodes ip]).setShared(false);
TcpDiscoverySpi spi = new TcpDiscoverySpi();
spi.setIpFinder(ipFinder);
DataRegionConfiguration dataRegionConfiguration = new
DataRegionConfiguration();
dataRegionConfiguration.setPersistenceEnabled(false);
igniteCfg.setDiscoverySpi(spi).setDataStorageConfiguration(dataRegionConfiguration);
CacheConfiguration cacheCfg = new CacheConfiguration<>(cacheName);
cacheCfg.setCacheMode(CacheMode.PARTITIONED)
.setBackups(0)
.setIndexedTypes(Integer.class, AlarmRecord.class)
.setSqlFunctionClasses(ExtIgniteFunctions.class)
.setRebalanceDelay(-1)
.setOnheapCacheEnabled(false)
.setSqlOnheapCacheEnabled(false)
.setQueryParallelism(2)
.setRebalanceMode(CacheRebalanceMode.NONE)
.setAffinity(affFunc);
Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and
avoid full outages
How should I configure it to ensure that queries already executed during the
restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]