[
https://issues.apache.org/jira/browse/GEODE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirk Lund updated GEODE-8357:
-----------------------------
Summary: Exhausting the high priority message thread pool can result in
deadlock (was: Exhausting the high priority message pool can result in
deadlock)
> Exhausting the high priority message thread pool can result in deadlock
> -----------------------------------------------------------------------
>
> Key: GEODE-8357
> URL: https://issues.apache.org/jira/browse/GEODE-8357
> Project: Geode
> Issue Type: Bug
> Components: messaging
> Affects Versions: 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0,
> 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0
> Reporter: Kirk Lund
> Assignee: Kirk Lund
> Priority: Major
> Labels: GeodeOperationAPI
>
> The system property "DistributionManager.MAX_THREADS" default to 100:
> {noformat}
> int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
> {noformat}
> The system property used to be defined in geode-core
> ClusterDistributionManager and has moved to geode-core OperationExecutors.
> The value is used to limit ClusterOperationExecutors threadPool and
> highPriorityPool:
> {noformat}
> threadPool =
> CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message
> Processor ",
> thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
> MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
> INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
> highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
> "Pooled High Priority Message Processor ",
> thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
> MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
> INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
> {noformat}
> I have seen server startup hang when recovering lots of expired entries from
> disk while using PDX. The hang looks like a dlock request for the PDX lock is
> not receiving a response. Checking the value for the
> distributionStats#highPriorityQueueSize statistic (in VSD) shows the value
> maxed out and never dropping.
> The dlock response granting the PDX lock is stuck in the highPriorityQueue
> because there are no more highPriorityQueue threads available to process the
> response. All of the highPriorityQueue thread stack dumps show tasks such as
> recovering bucket from disk are blocked waiting for the PDX lock.
> Several changes could improve this situation, either in conjunction or
> individually:
> # improve observability to enable support to identify that this situation has
> occurred
> # automatically identify this situation and warn the user with a log statement
> # automatically prevent this situation
> # identify the messages that are prone to causing deadlocks and move them to
> a dedicated thread pool with a higher limit
--
This message was sent by Atlassian Jira
(v8.3.4#803005)