Re: Subject: New branch and feature freeze for Lucene 9.4.0

Michael Sokolov Sun, 18 Sep 2022 15:24:58 -0700

OK, I think I was wrong about latency having increased due to a change
in KnnGraphTester -- I did some testing there and couldn't reproduce.
There does seem to be a slight vector search latency increase,
possibly noise, but maybe due to the branching introduced to check
whether to do byte vs float operations? It would be a little
surprising if that were the case given the small number of branchings
compared to the number of multiplies in dot-product though.


On Sun, Sep 18, 2022 at 3:25 PM Michael Sokolov <[email protected]> wrote:
>
> Thanks for the deep-dive Julie. I was able to reproduce the changing
> recall. I had introduced some bugs in the diversity checks (that may
> have partially canceled each other out? it's hard to understand what
> was happening in the buggy case) and posted a fix today
> https://github.com/apache/lucene/pull/11781.
>
> There are a couple of other outstanding issues I found while doing a
> bunch of git bisecting;
>
> I think we might have introduced a (test-only) performance regression
> in KnnGraphTester
>
> We may still be over-allocating the size of NeighborArray, leading to
> excessive segmentation? I wonder if we could avoid dynamic
> re-allocation there, and simply initialize every neighbor array to
> 2*M+1.
>
> While I don't think these are necessarily blockers, given that we are
> releasing HNSW improvements, it seems like we should address these,
> especially as the build-graph-on-index is one of the things we are
> releasing, and it is (may be?) impacted. I will see if I can put up a
> patch or two.
>
> It would be great if you all are able to test again with
> https://github.com/apache/lucene/pull/11781/ applied
>
> -Mike
>
> On Fri, Sep 16, 2022 at 11:07 AM Adrien Grand <[email protected]> wrote:
> >
> > Thank you Mike, I just backported the change.
> >
> > On Thu, Sep 15, 2022 at 6:32 PM Michael Sokolov <[email protected]> wrote:
> >>
> >> it looks like a small bug fix, we have had on main (and 9.x?) for a
> >> while now and no test failures showed up, I guess. Should be OK to
> >> port. I plan to cut artifacts this weekend, or Monday at the latest,
> >> but if you can do the backport today or tomorrow, that's fine by me.
> >>
> >> On Thu, Sep 15, 2022 at 10:55 AM Adrien Grand <[email protected]> wrote:
> >> >
> >> > Mike, I'm tempted to backport https://github.com/apache/lucene/pull/1068 
> >> > to branch_9_4, which is a bugfix that looks pretty safe to me. What do 
> >> > you think?
> >> >
> >> > On Tue, Sep 13, 2022 at 4:11 PM Mayya Sharipova 
> >> > <[email protected]> wrote:
> >> >>
> >> >> Thanks for running more tests, Michael.
> >> >> It is encouraging that you saw a similar performance between 9.3 and 
> >> >> 9.4. I will also run more tests with different parameters.
> >> >>
> >> >> On Tue, Sep 13, 2022 at 9:30 AM Michael Sokolov <[email protected]> 
> >> >> wrote:
> >> >>>
> >> >>> As a follow-up, I ran a test using the same parameters as above, only
> >> >>> changing M=200 to M=16. This did result in a single segment in both
> >> >>> cases (9.3, 9.4) and the performance was pretty similar; within noise
> >> >>> I think. The main difference I saw was that the 9.3 index was written
> >> >>> using CFS:
> >> >>>
> >> >>> 9.4:
> >> >>> recall  latency nDoc    fanout  maxConn beamWidth       visited index 
> >> >>> ms
> >> >>> 0.755    1.36   1000000 100     16      100     200     891402  1.00
> >> >>>  post-filter
> >> >>> -rw-r--r-- 1 sokolovm amazon 382M Sep 13 13:06
> >> >>> _0_Lucene94HnswVectorsFormat_0.vec
> >> >>> -rw-r--r-- 1 sokolovm amazon 262K Sep 13 13:06
> >> >>> _0_Lucene94HnswVectorsFormat_0.vem
> >> >>> -rw-r--r-- 1 sokolovm amazon 131M Sep 13 13:06
> >> >>> _0_Lucene94HnswVectorsFormat_0.vex
> >> >>>
> >> >>> 9.3:
> >> >>> recall  latency nDoc    fanout  maxConn beamWidth       visited index 
> >> >>> ms
> >> >>> 0.775    1.34   1000000 100     16      100     4033    977043
> >> >>> rw-r--r-- 1 sokolovm amazon  297 Sep 13 13:26 _0.cfe
> >> >>> -rw-r--r-- 1 sokolovm amazon 516M Sep 13 13:26 _0.cfs
> >> >>> -rw-r--r-- 1 sokolovm amazon  340 Sep 13 13:26 _0.si
> >> >>>
> >> >>> On Tue, Sep 13, 2022 at 8:50 AM Michael Sokolov <[email protected]> 
> >> >>> wrote:
> >> >>> >
> >> >>> > I ran another test. I thought I had increased the RAM buffer size to
> >> >>> > 8G and heap to 16G. However I still see two segments in the index 
> >> >>> > that
> >> >>> > was created. And looking at the infostream I see:
> >> >>> >
> >> >>> > dir=MMapDirectory@/local/home/sokolovm/workspace/knn-perf/glove-100-angular.hdf5-train-200-200.index
> >> >>> > lockFactory=org\
> >> >>> > .apache.lucene.store.NativeFSLockFactory@4466af20
> >> >>> > index=
> >> >>> > version=9.4.0
> >> >>> > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> >> >>> > ramBufferSizeMB=8000.0
> >> >>> > maxBufferedDocs=-1
> >> >>> > ...
> >> >>> > perThreadHardLimitMB=1945
> >> >>> > ...
> >> >>> > DWPT 0 [2022-09-13T02:42:53.329404950Z; main]: flush postings as
> >> >>> > segment _6 numDocs=555373
> >> >>> > IW 0 [2022-09-13T02:42:53.330671171Z; main]: 0 msec to write norms
> >> >>> > IW 0 [2022-09-13T02:42:53.331113184Z; main]: 0 msec to write 
> >> >>> > docValues
> >> >>> > IW 0 [2022-09-13T02:42:53.331320146Z; main]: 0 msec to write points
> >> >>> > IW 0 [2022-09-13T02:42:56.424195657Z; main]: 3092 msec to write 
> >> >>> > vectors
> >> >>> > IW 0 [2022-09-13T02:42:56.429239944Z; main]: 4 msec to finish stored 
> >> >>> > fields
> >> >>> > IW 0 [2022-09-13T02:42:56.429593512Z; main]: 0 msec to write postings
> >> >>> > and finish vectors
> >> >>> > IW 0 [2022-09-13T02:42:56.430309031Z; main]: 0 msec to write 
> >> >>> > fieldInfos
> >> >>> > DWPT 0 [2022-09-13T02:42:56.431721622Z; main]: new segment has 0 
> >> >>> > deleted docs
> >> >>> > DWPT 0 [2022-09-13T02:42:56.431921144Z; main]: new segment has 0
> >> >>> > soft-deleted docs
> >> >>> > DWPT 0 [2022-09-13T02:42:56.435738086Z; main]: new segment has no
> >> >>> > vectors; no norms; no docValues; no prox; freqs
> >> >>> > DWPT 0 [2022-09-13T02:42:56.435952356Z; main]:
> >> >>> > flushedFiles=[_6_Lucene94HnswVectorsFormat_0.vec, _6.fdm, _6.fdt, 
> >> >>> > _6_\
> >> >>> > Lucene94HnswVectorsFormat_0.vem, _6.fnm, _6.fdx,
> >> >>> > _6_Lucene94HnswVectorsFormat_0.vex]
> >> >>> > DWPT 0 [2022-09-13T02:42:56.436121861Z; main]: flushed codec=Lucene94
> >> >>> > DWPT 0 [2022-09-13T02:42:56.437691468Z; main]: flushed: segment=_6
> >> >>> > ramUsed=1,945.002 MB newFlushedSize=1,065.701 MB \
> >> >>> > docs/MB=521.134
> >> >>> >
> >> >>> > so I think it's this perThreadHardLimit that is triggering the
> >> >>> > flushes? TBH this isn't something I had seen before; but the docs 
> >> >>> > say:
> >> >>> >
> >> >>> >   /**
> >> >>> >    * Expert: Sets the maximum memory consumption per thread 
> >> >>> > triggering
> >> >>> > a forced flush if exceeded. A
> >> >>> >    * {@link DocumentsWriterPerThread} is forcefully flushed once it
> >> >>> > exceeds this limit even if the
> >> >>> >    * {@link #getRAMBufferSizeMB()} has not been exceeded. This is a
> >> >>> > safety limit to prevent a {@link
> >> >>> >    * DocumentsWriterPerThread} from address space exhaustion due to
> >> >>> > its internal 32 bit signed
> >> >>> >    * integer based memory addressing. The given value must be less
> >> >>> > that 2GB (2048MB)
> >> >>> >    *
> >> >>> >    * @see #DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB
> >> >>> >    */
> >> >>> >
> >> >>> > On Mon, Sep 12, 2022 at 6:28 PM Michael Sokolov <[email protected]> 
> >> >>> > wrote:
> >> >>> > >
> >> >>> > > Hi Mayya, thanks for persisting - I think we need to wrestle this 
> >> >>> > > to
> >> >>> > > the ground for sure. In the test I ran, RAM buffer was the default
> >> >>> > > checked in, which is weirdly: 1994MB. I did not specifically set 
> >> >>> > > heap
> >> >>> > > size. I used maxConn/M=200. I'll  try with larger buffer to see if 
> >> >>> > > I
> >> >>> > > can get 9.4 to produce a single segment for the same test 
> >> >>> > > settings. I
> >> >>> > > see you used a much smaller M (16), which should have produced 
> >> >>> > > quite
> >> >>> > > small graphs, and I agree, should have been a single segment. Were 
> >> >>> > > you
> >> >>> > > able to verify the number of segments?
> >> >>> > >
> >> >>> > > Agree that decrease in recall is not expected when more segments 
> >> >>> > > are produced.
> >> >>> > >
> >> >>> > > On Mon, Sep 12, 2022 at 1:51 PM Mayya Sharipova
> >> >>> > > <[email protected]> wrote:
> >> >>> > > >
> >> >>> > > > Hello Michael,
> >> >>> > > > Thanks for checking.
> >> >>> > > > Sorry for bringing this up again.
> >> >>> > > > First of all, I am ok with proceeding with the Lucene 9.4 
> >> >>> > > > release and leaving the performance investigations for later.
> >> >>> > > >
> >> >>> > > > I am interested in what's the maxConn/M value you used for your 
> >> >>> > > > tests? What was the heap memory and the size of the RAM buffer 
> >> >>> > > > for indexing?
> >> >>> > > > Usually, when we have multiple segments, recall should increase, 
> >> >>> > > > not decrease. But I agree that with multiple segments we can see 
> >> >>> > > > a big drop in QPS.
> >> >>> > > >
> >> >>> > > > Here is my investigation with detailed output of the performance 
> >> >>> > > > difference between 9.3 and 9.4 releases. In my tests I used a 
> >> >>> > > > large indexing buffer (2Gb) and large heap (5Gb) to end up with 
> >> >>> > > > a single segment for both 9.3 and 9.4 tests, but still see a big 
> >> >>> > > > drop in QPS in 9.4.
> >> >>> > > >
> >> >>> > > > Thank you.
> >> >>> > > >
> >> >>> > > >
> >> >>> > > >
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > On Fri, Sep 9, 2022 at 12:21 PM Alan Woodward 
> >> >>> > > > <[email protected]> wrote:
> >> >>> > > >>
> >> >>> > > >> Done.  Thanks!
> >> >>> > > >>
> >> >>> > > >> > On 9 Sep 2022, at 16:32, Michael Sokolov <[email protected]> 
> >> >>> > > >> > wrote:
> >> >>> > > >> >
> >> >>> > > >> > Hi Alan - I checked out the interval queries patch; seems 
> >> >>> > > >> > pretty safe,
> >> >>> > > >> > please go ahead and port to 9.4.  Thanks!
> >> >>> > > >> >
> >> >>> > > >> > Mike
> >> >>> > > >> >
> >> >>> > > >> > On Fri, Sep 9, 2022 at 10:41 AM Alan Woodward 
> >> >>> > > >> > <[email protected]> wrote:
> >> >>> > > >> >>
> >> >>> > > >> >> Hi Mike,
> >> >>> > > >> >>
> >> >>> > > >> >> I’ve opened https://github.com/apache/lucene/pull/11760 as a 
> >> >>> > > >> >> small bug fix PR for a problem with interval queries.  Am I 
> >> >>> > > >> >> OK to port this to the 9.4 branch?
> >> >>> > > >> >>
> >> >>> > > >> >> Thanks, Alan
> >> >>> > > >> >>
> >> >>> > > >> >> On 2 Sep 2022, at 20:42, Michael Sokolov 
> >> >>> > > >> >> <[email protected]> wrote:
> >> >>> > > >> >>
> >> >>> > > >> >> NOTICE:
> >> >>> > > >> >>
> >> >>> > > >> >> Branch branch_9_4 has been cut and versions updated to 9.5 
> >> >>> > > >> >> on stable branch.
> >> >>> > > >> >>
> >> >>> > > >> >> Please observe the normal rules:
> >> >>> > > >> >>
> >> >>> > > >> >> * No new features may be committed to the branch.
> >> >>> > > >> >> * Documentation patches, build patches and serious bug fixes 
> >> >>> > > >> >> may be
> >> >>> > > >> >> committed to the branch. However, you should submit all 
> >> >>> > > >> >> patches you
> >> >>> > > >> >> want to commit to Jira first to give others the chance to 
> >> >>> > > >> >> review
> >> >>> > > >> >> and possibly vote against the patch. Keep in mind that it is 
> >> >>> > > >> >> our
> >> >>> > > >> >> main intention to keep the branch as stable as possible.
> >> >>> > > >> >> * All patches that are intended for the branch should first 
> >> >>> > > >> >> be committed
> >> >>> > > >> >> to the unstable branch, merged into the stable branch, and 
> >> >>> > > >> >> then into
> >> >>> > > >> >> the current release branch.
> >> >>> > > >> >> * Normal unstable and stable branch development may continue 
> >> >>> > > >> >> as usual.
> >> >>> > > >> >> However, if you plan to commit a big change to the unstable 
> >> >>> > > >> >> branch
> >> >>> > > >> >> while the branch feature freeze is in effect, think twice: 
> >> >>> > > >> >> can't the
> >> >>> > > >> >> addition wait a couple more days? Merges of bug fixes into 
> >> >>> > > >> >> the branch
> >> >>> > > >> >> may become more difficult.
> >> >>> > > >> >> * Only Jira issues with Fix version 9.4 and priority 
> >> >>> > > >> >> "Blocker" will delay
> >> >>> > > >> >> a release candidate build.
> >> >>> > > >> >>
> >> >>> > > >> >> ---------------------------------------------------------------------
> >> >>> > > >> >> To unsubscribe, e-mail: [email protected]
> >> >>> > > >> >> For additional commands, e-mail: [email protected]
> >> >>> > > >> >>
> >> >>> > > >> >>
> >> >>> > > >> >
> >> >>> > > >> > ---------------------------------------------------------------------
> >> >>> > > >> > To unsubscribe, e-mail: [email protected]
> >> >>> > > >> > For additional commands, e-mail: [email protected]
> >> >>> > > >> >
> >> >>> > > >>
> >> >>> > > >>
> >> >>> > > >> ---------------------------------------------------------------------
> >> >>> > > >> To unsubscribe, e-mail: [email protected]
> >> >>> > > >> For additional commands, e-mail: [email protected]
> >> >>> > > >>
> >> >>>
> >> >>> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: [email protected]
> >> >>> For additional commands, e-mail: [email protected]
> >> >>>
> >> >
> >> >
> >> > --
> >> > Adrien
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> > --
> > Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Subject: New branch and feature freeze for Lucene 9.4.0

Reply via email to