Re: Subject: New branch and feature freeze for Lucene 9.4.0

Adrien Grand Tue, 20 Sep 2022 03:24:54 -0700

Hi Mike,

If you have not started a RC yet, I'd like to include some small fixes for
bugs that were recently introduced in Lucene:
 - https://github.com/apache/lucene/pull/11792
 - https://github.com/apache/lucene/pull/11794


On Tue, Sep 20, 2022 at 1:26 AM Julie Tibshirani <[email protected]>
wrote:

> Sorry for the confusion. To explain, I use a local ann-benchmarks set-up
> that makes use of KnnGraphTester. It is a bit hacky and I accidentally
> included the warm-ups in the final timings. So the change to warm-up
> explains why we saw different results in our tests. This is great
> motivation to solidify and publish my local ann-benchmarks set-up so that
> it's not so fragile!
>
> In summary, with your latest fix the recall and QPS look good to me -- I
> don't detect any regression between 9.3 and 9.4.
>
> Julie
>
> On Mon, Sep 19, 2022 at 3:45 PM Michael Sokolov <[email protected]>
> wrote:
>
>> I'm confused, since warming should not be counted in the timings. Are you
>> saying that the recall was affected??
>>
>> On Mon, Sep 19, 2022 at 6:12 PM Julie Tibshirani <[email protected]>
>> wrote:
>>
>>> Using the ann-benchmarks framework, I still saw a similar regression as
>>> Mayya between 9.3 and 9.4. I investigated and found it was due to
>>> "KnnGraphTester to use KnnVectorQuery" (
>>> https://github.com/apache/lucene/pull/796), specifically the change to
>>> the warm-up strategy. If I revert it, the results look exactly as expected.
>>>
>>> I guess we can keep an eye on the nightly benchmarks tomorrow to
>>> double-check there's no drop. It would also be nice to formalize the
>>> ann-benchmarks set-up and run it regularly (like we've discussed in
>>> https://github.com/apache/lucene/issues/10665).
>>>
>>> Julie
>>>
>>> On Mon, Sep 19, 2022 at 10:33 AM Michael Sokolov <[email protected]>
>>> wrote:
>>>
>>>> Thanks for your speedy testing! I am observing comparable latencies
>>>> *when the index geometry (ie number of segments)* is unchanged. Agree we
>>>> can leave this for a later day. I'll proceed to cut 9.4 artifacts
>>>>
>>>> On Mon, Sep 19, 2022 at 11:02 AM Mayya Sharipova
>>>> <[email protected]> wrote:
>>>>
>>>>> It would be great if you all are able to test again with
>>>>>> https://github.com/apache/lucene/pull/11781/ applied
>>>>>
>>>>>
>>>>>
>>>>> I ran  the ann benchmarks with this change, and was happy to confirm
>>>>> that in my test recall with this PR is the same as in 9.3 branch, although
>>>>> QPS is lower, but we can investigate QPSs later.
>>>>>
>>>>> glove-100-angular M:16 efConstruction:100
>>>>> 9.3 recall9.3 QPSthis PR recallthis PR QPS
>>>>> n_cands=10 0.620 2745.933 0.620 1675.500
>>>>> n_cands=20 0.680 2288.665 0.680 1512.744
>>>>> n_cands=40 0.746 1770.243 0.746 1040.240
>>>>> n_cands=80 0.809 1226.738 0.809 695.236
>>>>> n_cands=120 0.843 948.908 0.843 525.914
>>>>> n_cands=200 0.878 671.781 0.878 351.529
>>>>> n_cands=400 0.918 392.265 0.918 207.854
>>>>> n_cands=600 0.937 282.403 0.937 144.311
>>>>> n_cands=800 0.949 214.620 0.949 116.875
>>>>>
>>>>> On Sun, Sep 18, 2022 at 6:25 PM Michael Sokolov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> OK, I think I was wrong about latency having increased due to a change
>>>>>> in KnnGraphTester -- I did some testing there and couldn't reproduce.
>>>>>> There does seem to be a slight vector search latency increase,
>>>>>> possibly noise, but maybe due to the branching introduced to check
>>>>>> whether to do byte vs float operations? It would be a little
>>>>>> surprising if that were the case given the small number of branchings
>>>>>> compared to the number of multiplies in dot-product though.
>>>>>>
>>>>>> On Sun, Sep 18, 2022 at 3:25 PM Michael Sokolov <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > Thanks for the deep-dive Julie. I was able to reproduce the changing
>>>>>> > recall. I had introduced some bugs in the diversity checks (that may
>>>>>> > have partially canceled each other out? it's hard to understand what
>>>>>> > was happening in the buggy case) and posted a fix today
>>>>>> > https://github.com/apache/lucene/pull/11781.
>>>>>> >
>>>>>> > There are a couple of other outstanding issues I found while doing a
>>>>>> > bunch of git bisecting;
>>>>>> >
>>>>>> > I think we might have introduced a (test-only) performance
>>>>>> regression
>>>>>> > in KnnGraphTester
>>>>>> >
>>>>>> > We may still be over-allocating the size of NeighborArray, leading
>>>>>> to
>>>>>> > excessive segmentation? I wonder if we could avoid dynamic
>>>>>> > re-allocation there, and simply initialize every neighbor array to
>>>>>> > 2*M+1.
>>>>>> >
>>>>>> > While I don't think these are necessarily blockers, given that we
>>>>>> are
>>>>>> > releasing HNSW improvements, it seems like we should address these,
>>>>>> > especially as the build-graph-on-index is one of the things we are
>>>>>> > releasing, and it is (may be?) impacted. I will see if I can put up
>>>>>> a
>>>>>> > patch or two.
>>>>>> >
>>>>>> > It would be great if you all are able to test again with
>>>>>> > https://github.com/apache/lucene/pull/11781/ applied
>>>>>> >
>>>>>> > -Mike
>>>>>> >
>>>>>> > On Fri, Sep 16, 2022 at 11:07 AM Adrien Grand <[email protected]>
>>>>>> wrote:
>>>>>> > >
>>>>>> > > Thank you Mike, I just backported the change.
>>>>>> > >
>>>>>> > > On Thu, Sep 15, 2022 at 6:32 PM Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >>
>>>>>> > >> it looks like a small bug fix, we have had on main (and 9.x?)
>>>>>> for a
>>>>>> > >> while now and no test failures showed up, I guess. Should be OK
>>>>>> to
>>>>>> > >> port. I plan to cut artifacts this weekend, or Monday at the
>>>>>> latest,
>>>>>> > >> but if you can do the backport today or tomorrow, that's fine by
>>>>>> me.
>>>>>> > >>
>>>>>> > >> On Thu, Sep 15, 2022 at 10:55 AM Adrien Grand <[email protected]>
>>>>>> wrote:
>>>>>> > >> >
>>>>>> > >> > Mike, I'm tempted to backport
>>>>>> https://github.com/apache/lucene/pull/1068 to branch_9_4, which is a
>>>>>> bugfix that looks pretty safe to me. What do you think?
>>>>>> > >> >
>>>>>> > >> > On Tue, Sep 13, 2022 at 4:11 PM Mayya Sharipova <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>
>>>>>> > >> >> Thanks for running more tests, Michael.
>>>>>> > >> >> It is encouraging that you saw a similar performance between
>>>>>> 9.3 and 9.4. I will also run more tests with different parameters.
>>>>>> > >> >>
>>>>>> > >> >> On Tue, Sep 13, 2022 at 9:30 AM Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>>
>>>>>> > >> >>> As a follow-up, I ran a test using the same parameters as
>>>>>> above, only
>>>>>> > >> >>> changing M=200 to M=16. This did result in a single segment
>>>>>> in both
>>>>>> > >> >>> cases (9.3, 9.4) and the performance was pretty similar;
>>>>>> within noise
>>>>>> > >> >>> I think. The main difference I saw was that the 9.3 index
>>>>>> was written
>>>>>> > >> >>> using CFS:
>>>>>> > >> >>>
>>>>>> > >> >>> 9.4:
>>>>>> > >> >>> recall  latency nDoc    fanout  maxConn beamWidth
>>>>>>  visited index ms
>>>>>> > >> >>> 0.755    1.36   1000000 100     16      100     200
>>>>>>  891402  1.00
>>>>>> > >> >>>  post-filter
>>>>>> > >> >>> -rw-r--r-- 1 sokolovm amazon 382M Sep 13 13:06
>>>>>> > >> >>> _0_Lucene94HnswVectorsFormat_0.vec
>>>>>> > >> >>> -rw-r--r-- 1 sokolovm amazon 262K Sep 13 13:06
>>>>>> > >> >>> _0_Lucene94HnswVectorsFormat_0.vem
>>>>>> > >> >>> -rw-r--r-- 1 sokolovm amazon 131M Sep 13 13:06
>>>>>> > >> >>> _0_Lucene94HnswVectorsFormat_0.vex
>>>>>> > >> >>>
>>>>>> > >> >>> 9.3:
>>>>>> > >> >>> recall  latency nDoc    fanout  maxConn beamWidth
>>>>>>  visited index ms
>>>>>> > >> >>> 0.775    1.34   1000000 100     16      100     4033
>>>>>> 977043
>>>>>> > >> >>> rw-r--r-- 1 sokolovm amazon  297 Sep 13 13:26 _0.cfe
>>>>>> > >> >>> -rw-r--r-- 1 sokolovm amazon 516M Sep 13 13:26 _0.cfs
>>>>>> > >> >>> -rw-r--r-- 1 sokolovm amazon  340 Sep 13 13:26 _0.si
>>>>>> > >> >>>
>>>>>> > >> >>> On Tue, Sep 13, 2022 at 8:50 AM Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> >
>>>>>> > >> >>> > I ran another test. I thought I had increased the RAM
>>>>>> buffer size to
>>>>>> > >> >>> > 8G and heap to 16G. However I still see two segments in
>>>>>> the index that
>>>>>> > >> >>> > was created. And looking at the infostream I see:
>>>>>> > >> >>> >
>>>>>> > >> >>> > dir=MMapDirectory@
>>>>>> /local/home/sokolovm/workspace/knn-perf/glove-100-angular.hdf5-train-200-200.index
>>>>>> > >> >>> > lockFactory=org\
>>>>>> > >> >>> > .apache.lucene.store.NativeFSLockFactory@4466af20
>>>>>> > >> >>> > index=
>>>>>> > >> >>> > version=9.4.0
>>>>>> > >> >>> >
>>>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>>> > >> >>> > ramBufferSizeMB=8000.0
>>>>>> > >> >>> > maxBufferedDocs=-1
>>>>>> > >> >>> > ...
>>>>>> > >> >>> > perThreadHardLimitMB=1945
>>>>>> > >> >>> > ...
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:53.329404950Z; main]: flush
>>>>>> postings as
>>>>>> > >> >>> > segment _6 numDocs=555373
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:53.330671171Z; main]: 0 msec to
>>>>>> write norms
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:53.331113184Z; main]: 0 msec to
>>>>>> write docValues
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:53.331320146Z; main]: 0 msec to
>>>>>> write points
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:56.424195657Z; main]: 3092 msec to
>>>>>> write vectors
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:56.429239944Z; main]: 4 msec to
>>>>>> finish stored fields
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:56.429593512Z; main]: 0 msec to
>>>>>> write postings
>>>>>> > >> >>> > and finish vectors
>>>>>> > >> >>> > IW 0 [2022-09-13T02:42:56.430309031Z; main]: 0 msec to
>>>>>> write fieldInfos
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.431721622Z; main]: new segment
>>>>>> has 0 deleted docs
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.431921144Z; main]: new segment
>>>>>> has 0
>>>>>> > >> >>> > soft-deleted docs
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.435738086Z; main]: new segment
>>>>>> has no
>>>>>> > >> >>> > vectors; no norms; no docValues; no prox; freqs
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.435952356Z; main]:
>>>>>> > >> >>> > flushedFiles=[_6_Lucene94HnswVectorsFormat_0.vec, _6.fdm,
>>>>>> _6.fdt, _6_\
>>>>>> > >> >>> > Lucene94HnswVectorsFormat_0.vem, _6.fnm, _6.fdx,
>>>>>> > >> >>> > _6_Lucene94HnswVectorsFormat_0.vex]
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.436121861Z; main]: flushed
>>>>>> codec=Lucene94
>>>>>> > >> >>> > DWPT 0 [2022-09-13T02:42:56.437691468Z; main]: flushed:
>>>>>> segment=_6
>>>>>> > >> >>> > ramUsed=1,945.002 MB newFlushedSize=1,065.701 MB \
>>>>>> > >> >>> > docs/MB=521.134
>>>>>> > >> >>> >
>>>>>> > >> >>> > so I think it's this perThreadHardLimit that is triggering
>>>>>> the
>>>>>> > >> >>> > flushes? TBH this isn't something I had seen before; but
>>>>>> the docs say:
>>>>>> > >> >>> >
>>>>>> > >> >>> >   /**
>>>>>> > >> >>> >    * Expert: Sets the maximum memory consumption per
>>>>>> thread triggering
>>>>>> > >> >>> > a forced flush if exceeded. A
>>>>>> > >> >>> >    * {@link DocumentsWriterPerThread} is forcefully
>>>>>> flushed once it
>>>>>> > >> >>> > exceeds this limit even if the
>>>>>> > >> >>> >    * {@link #getRAMBufferSizeMB()} has not been exceeded.
>>>>>> This is a
>>>>>> > >> >>> > safety limit to prevent a {@link
>>>>>> > >> >>> >    * DocumentsWriterPerThread} from address space
>>>>>> exhaustion due to
>>>>>> > >> >>> > its internal 32 bit signed
>>>>>> > >> >>> >    * integer based memory addressing. The given value must
>>>>>> be less
>>>>>> > >> >>> > that 2GB (2048MB)
>>>>>> > >> >>> >    *
>>>>>> > >> >>> >    * @see #DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB
>>>>>> > >> >>> >    */
>>>>>> > >> >>> >
>>>>>> > >> >>> > On Mon, Sep 12, 2022 at 6:28 PM Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > Hi Mayya, thanks for persisting - I think we need to
>>>>>> wrestle this to
>>>>>> > >> >>> > > the ground for sure. In the test I ran, RAM buffer was
>>>>>> the default
>>>>>> > >> >>> > > checked in, which is weirdly: 1994MB. I did not
>>>>>> specifically set heap
>>>>>> > >> >>> > > size. I used maxConn/M=200. I'll  try with larger buffer
>>>>>> to see if I
>>>>>> > >> >>> > > can get 9.4 to produce a single segment for the same
>>>>>> test settings. I
>>>>>> > >> >>> > > see you used a much smaller M (16), which should have
>>>>>> produced quite
>>>>>> > >> >>> > > small graphs, and I agree, should have been a single
>>>>>> segment. Were you
>>>>>> > >> >>> > > able to verify the number of segments?
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > Agree that decrease in recall is not expected when more
>>>>>> segments are produced.
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > On Mon, Sep 12, 2022 at 1:51 PM Mayya Sharipova
>>>>>> > >> >>> > > <[email protected]> wrote:
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > > Hello Michael,
>>>>>> > >> >>> > > > Thanks for checking.
>>>>>> > >> >>> > > > Sorry for bringing this up again.
>>>>>> > >> >>> > > > First of all, I am ok with proceeding with the Lucene
>>>>>> 9.4 release and leaving the performance investigations for later.
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > > I am interested in what's the maxConn/M value you used
>>>>>> for your tests? What was the heap memory and the size of the RAM buffer 
>>>>>> for
>>>>>> indexing?
>>>>>> > >> >>> > > > Usually, when we have multiple segments, recall should
>>>>>> increase, not decrease. But I agree that with multiple segments we can 
>>>>>> see
>>>>>> a big drop in QPS.
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > > Here is my investigation with detailed output of the
>>>>>> performance difference between 9.3 and 9.4 releases. In my tests I used a
>>>>>> large indexing buffer (2Gb) and large heap (5Gb) to end up with a single
>>>>>> segment for both 9.3 and 9.4 tests, but still see a big drop in QPS in 
>>>>>> 9.4.
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > > Thank you.
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > >
>>>>>> > >> >>> > > > On Fri, Sep 9, 2022 at 12:21 PM Alan Woodward <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> > > >>
>>>>>> > >> >>> > > >> Done.  Thanks!
>>>>>> > >> >>> > > >>
>>>>>> > >> >>> > > >> > On 9 Sep 2022, at 16:32, Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> > > >> >
>>>>>> > >> >>> > > >> > Hi Alan - I checked out the interval queries patch;
>>>>>> seems pretty safe,
>>>>>> > >> >>> > > >> > please go ahead and port to 9.4.  Thanks!
>>>>>> > >> >>> > > >> >
>>>>>> > >> >>> > > >> > Mike
>>>>>> > >> >>> > > >> >
>>>>>> > >> >>> > > >> > On Fri, Sep 9, 2022 at 10:41 AM Alan Woodward <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> Hi Mike,
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> I’ve opened
>>>>>> https://github.com/apache/lucene/pull/11760 as a small bug fix PR
>>>>>> for a problem with interval queries.  Am I OK to port this to the 9.4
>>>>>> branch?
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> Thanks, Alan
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> On 2 Sep 2022, at 20:42, Michael Sokolov <
>>>>>> [email protected]> wrote:
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> NOTICE:
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> Branch branch_9_4 has been cut and versions
>>>>>> updated to 9.5 on stable branch.
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> Please observe the normal rules:
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >> * No new features may be committed to the branch.
>>>>>> > >> >>> > > >> >> * Documentation patches, build patches and serious
>>>>>> bug fixes may be
>>>>>> > >> >>> > > >> >> committed to the branch. However, you should
>>>>>> submit all patches you
>>>>>> > >> >>> > > >> >> want to commit to Jira first to give others the
>>>>>> chance to review
>>>>>> > >> >>> > > >> >> and possibly vote against the patch. Keep in mind
>>>>>> that it is our
>>>>>> > >> >>> > > >> >> main intention to keep the branch as stable as
>>>>>> possible.
>>>>>> > >> >>> > > >> >> * All patches that are intended for the branch
>>>>>> should first be committed
>>>>>> > >> >>> > > >> >> to the unstable branch, merged into the stable
>>>>>> branch, and then into
>>>>>> > >> >>> > > >> >> the current release branch.
>>>>>> > >> >>> > > >> >> * Normal unstable and stable branch development
>>>>>> may continue as usual.
>>>>>> > >> >>> > > >> >> However, if you plan to commit a big change to the
>>>>>> unstable branch
>>>>>> > >> >>> > > >> >> while the branch feature freeze is in effect,
>>>>>> think twice: can't the
>>>>>> > >> >>> > > >> >> addition wait a couple more days? Merges of bug
>>>>>> fixes into the branch
>>>>>> > >> >>> > > >> >> may become more difficult.
>>>>>> > >> >>> > > >> >> * Only Jira issues with Fix version 9.4 and
>>>>>> priority "Blocker" will delay
>>>>>> > >> >>> > > >> >> a release candidate build.
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> > >> >>> > > >> >> To unsubscribe, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >> >> For additional commands, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >>
>>>>>> > >> >>> > > >> >
>>>>>> > >> >>> > > >> >
>>>>>> ---------------------------------------------------------------------
>>>>>> > >> >>> > > >> > To unsubscribe, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >> > For additional commands, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >> >
>>>>>> > >> >>> > > >>
>>>>>> > >> >>> > > >>
>>>>>> > >> >>> > > >>
>>>>>> ---------------------------------------------------------------------
>>>>>> > >> >>> > > >> To unsubscribe, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >> For additional commands, e-mail:
>>>>>> [email protected]
>>>>>> > >> >>> > > >>
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> ---------------------------------------------------------------------
>>>>>> > >> >>> To unsubscribe, e-mail: [email protected]
>>>>>> > >> >>> For additional commands, e-mail: [email protected]
>>>>>> > >> >>>
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > --
>>>>>> > >> > Adrien
>>>>>> > >>
>>>>>> > >>
>>>>>> ---------------------------------------------------------------------
>>>>>> > >> To unsubscribe, e-mail: [email protected]
>>>>>> > >> For additional commands, e-mail: [email protected]
>>>>>> > >>
>>>>>> > >
>>>>>> > >
>>>>>> > > --
>>>>>> > > Adrien
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>

-- 
Adrien

Re: Subject: New branch and feature freeze for Lucene 9.4.0

Reply via email to