Ankit Jain created LUCENE-10428: ----------------------------------- Summary: getMinCompetitiveScore method in MaxScoreSumPropagator fails to converge leading to busy threads in infinite loop Key: LUCENE-10428 URL: https://issues.apache.org/jira/browse/LUCENE-10428 Project: Lucene - Core Issue Type: Bug Components: core/query/scoring, core/search Reporter: Ankit Jain
Customers complained about high CPU for Elasticsearch cluster in production. We noticed that few search requests were stuck for long time ``` % curl -s localhost:9200/_cat/tasks?v indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205 AmMLzDQ4RrOJievRDeGFZw:569204 direct 1645195007282 14:36:47 6.2h indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075 emjWc5bUTG6lgnCGLulq-Q:502074 direct 1645195037259 14:37:17 6.2h indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270 emjWc5bUTG6lgnCGLulq-Q:583269 direct 1645201316981 16:21:56 4.5h ``` Flame graphs indicated that CPU time is mostly going into *getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some live JVM debugging found that org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had around 4 million invocations every second Figured out the values of some parameters from live debugging: ``` minScoreSum = 3.5541441 minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 3.554144322872162 returnObj scoreSumUpperBound = 3.5541444 Math.ulp(minScoreSum) = 2.3841858E-7 ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org