[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

Kevin Risden (Jira) Wed, 27 Apr 2022 10:55:05 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kevin Risden updated LUCENE-10542:
----------------------------------
    Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=500,width=500!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
      private long getValueForDoc(int doc) throws IOException {
        if (doc < lastDocID) {
          throw new IllegalArgumentException(
              "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
        }
        lastDocID = doc;
        int curDocID = arr.docID();
        if (doc > curDocID) {
          curDocID = arr.advance(doc);
        }
        if (doc == curDocID) {
          return arr.longValue();
        } else {
          return 0;
        }
      }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
      @Override
      public boolean exists(int doc) throws IOException {
        getValueForDoc(doc);
        return arr.docID() == doc;
      }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:

----

Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=250,width=250!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
      private long getValueForDoc(int doc) throws IOException {
        if (doc < lastDocID) {
          throw new IllegalArgumentException(
              "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
        }
        lastDocID = doc;
        int curDocID = arr.docID();
        if (doc > curDocID) {
          curDocID = arr.advance(doc);
        }
        if (doc == curDocID) {
          return arr.longValue();
        } else {
          return 0;
        }
      }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
      @Override
      public boolean exists(int doc) throws IOException {
        getValueForDoc(doc);
        return arr.docID() == doc;
      }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:

----

Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh


> FieldSource exists implementation can avoid value retrieval
> -----------------------------------------------------------
>
>                 Key: LUCENE-10542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10542
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Kevin Risden
>            Assignee: Kevin Risden
>            Priority: Minor
>         Attachments: flamegraph_getValueForDoc.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=500,width=500!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>       private long getValueForDoc(int doc) throws IOException {
>         if (doc < lastDocID) {
>           throw new IllegalArgumentException(
>               "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
>         }
>         lastDocID = doc;
>         int curDocID = arr.docID();
>         if (doc > curDocID) {
>           curDocID = arr.advance(doc);
>         }
>         if (doc == curDocID) {
>           return arr.longValue();
>         } else {
>           return 0;
>         }
>       }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>       @Override
>       public boolean exists(int doc) throws IOException {
>         getValueForDoc(doc);
>         return arr.docID() == doc;
>       }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> ----
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

Reply via email to