Kirk Lund created GEODE-8357:
--------------------------------

             Summary: Exhausting the high priority message pool can result in 
deadlock
                 Key: GEODE-8357
                 URL: https://issues.apache.org/jira/browse/GEODE-8357
             Project: Geode
          Issue Type: Bug
          Components: messaging
            Reporter: Kirk Lund


The system property "DistributionManager.MAX_THREADS" default to 100:
{noformat}
int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
{noformat}
The system property used to be defined in geode-core ClusterDistributionManager 
and has moved to geode-core OperationExecutors.

The value is used to limit ClusterOperationExecutors threadPool and 
highPriorityPool:
{noformat}
threadPool =
    CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
Processor ",
        thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
        MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
        INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());

highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
    "Pooled High Priority Message Processor ",
    thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
    MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
    INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
{noformat}
I have seen server startup hang when recovering lots of expired entries from 
disk while using PDX. The hang looks like a dlock request for the PDX lock is 
not receiving a response. Checking the value for the 
distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
maxed out and never dropping.

The dlock response granting the PDX lock is stuck in the highPriorityQueue 
because there are no more highPriorityQueue threads available to process the 
response. All of the highPriorityQueue thread stack dumps show tasks such as 
recovering bucket from disk are blocked waiting for the PDX lock.

Several changes could improve this situation, either in conjunction or 
separately:
# improve observability to enable support to identify that this situation has 
occurred
# automatically identify this situation and warn the user with a log statement
# automatically prevent this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to