Ankit Jain created LUCENE-10428:
-----------------------------------

             Summary: getMinCompetitiveScore method in MaxScoreSumPropagator 
fails to converge leading to busy threads in infinite loop
                 Key: LUCENE-10428
                 URL: https://issues.apache.org/jira/browse/LUCENE-10428
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/query/scoring, core/search
            Reporter: Ankit Jain


Customers complained about high CPU for Elasticsearch cluster in production. We 
noticed that few search requests were stuck for long time

```
% curl -s localhost:9200/_cat/tasks?v                               
indices:data/read/search[phase/query] AmMLzDQ4RrOJievRDeGFZw:569205  
AmMLzDQ4RrOJievRDeGFZw:569204  direct    1645195007282 14:36:47  6.2h
indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:502075  
emjWc5bUTG6lgnCGLulq-Q:502074  direct    1645195037259 14:37:17  6.2h
indices:data/read/search[phase/query] emjWc5bUTG6lgnCGLulq-Q:583270  
emjWc5bUTG6lgnCGLulq-Q:583269  direct    1645201316981 16:21:56  4.5h
```

Flame graphs indicated that CPU time is mostly going into 
*getMinCompetitiveScore method in MaxScoreSumPropagator*. After doing some live 
JVM debugging found that 
org.apache.lucene.search.MaxScoreSumPropagator.scoreSumUpperBound method had 
around 4 million invocations every second

Figured out the values of some parameters from live debugging:
```
minScoreSum = 3.5541441
minScore + sumOfOtherMaxScores (params[0] scoreSumUpperBound) = 
3.554144322872162
returnObj scoreSumUpperBound = 3.5541444
Math.ulp(minScoreSum) = 2.3841858E-7
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to