[jira] [Commented] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval

ASF subversion and git services (Jira) Fri, 29 Apr 2022 19:50:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530312#comment-17530312
 ]


ASF subversion and git services commented on LUCENE-10542:
----------------------------------------------------------

Commit fba1a68b4518b799d6e4c8637448e032825549cb in lucene's branch 
refs/heads/branch_9x from Kevin Risden
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fba1a68b451 ]

LUCENE-10542: FieldSource exists implementations can avoid value retrieval 
(#847)



> FieldSource exists implementations can avoid value retrieval
> ------------------------------------------------------------
>
>                 Key: LUCENE-10542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10542
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Kevin Risden
>            Assignee: Kevin Risden
>            Priority: Minor
>         Attachments: flamegraph_getValueForDoc.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>       private long getValueForDoc(int doc) throws IOException {
>         if (doc < lastDocID) {
>           throw new IllegalArgumentException(
>               "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
>         }
>         lastDocID = doc;
>         int curDocID = arr.docID();
>         if (doc > curDocID) {
>           curDocID = arr.advance(doc);
>         }
>         if (doc == curDocID) {
>           return arr.longValue();
>         } else {
>           return 0;
>         }
>       }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>       @Override
>       public boolean exists(int doc) throws IOException {
>         getValueForDoc(doc);
>         return arr.docID() == doc;
>       }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> ----
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark                                                       | Mode  | 
> Cnt | Score and Error  | Units |
> |-----------------------------------------------------------------|-------|-----|------------------|-------|
> | MyBenchmark.testMaxFloatFunction                                | thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction                             | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource             | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource          | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField                       | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField                    | thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField    | thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 
> 25  | 279.334 ± 10.511 | ops/s |
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval

Reply via email to