Thanks for the deep-dive Julie. I was able to reproduce the changing recall. I had introduced some bugs in the diversity checks (that may have partially canceled each other out? it's hard to understand what was happening in the buggy case) and posted a fix today https://github.com/apache/lucene/pull/11781.
There are a couple of other outstanding issues I found while doing a bunch of git bisecting; I think we might have introduced a (test-only) performance regression in KnnGraphTester We may still be over-allocating the size of NeighborArray, leading to excessive segmentation? I wonder if we could avoid dynamic re-allocation there, and simply initialize every neighbor array to 2*M+1. While I don't think these are necessarily blockers, given that we are releasing HNSW improvements, it seems like we should address these, especially as the build-graph-on-index is one of the things we are releasing, and it is (may be?) impacted. I will see if I can put up a patch or two. It would be great if you all are able to test again with https://github.com/apache/lucene/pull/11781/ applied -Mike On Fri, Sep 16, 2022 at 11:07 AM Adrien Grand <jpou...@gmail.com> wrote: > > Thank you Mike, I just backported the change. > > On Thu, Sep 15, 2022 at 6:32 PM Michael Sokolov <msoko...@gmail.com> wrote: >> >> it looks like a small bug fix, we have had on main (and 9.x?) for a >> while now and no test failures showed up, I guess. Should be OK to >> port. I plan to cut artifacts this weekend, or Monday at the latest, >> but if you can do the backport today or tomorrow, that's fine by me. >> >> On Thu, Sep 15, 2022 at 10:55 AM Adrien Grand <jpou...@gmail.com> wrote: >> > >> > Mike, I'm tempted to backport https://github.com/apache/lucene/pull/1068 >> > to branch_9_4, which is a bugfix that looks pretty safe to me. What do you >> > think? >> > >> > On Tue, Sep 13, 2022 at 4:11 PM Mayya Sharipova >> > <mayya.sharip...@elastic.co.invalid> wrote: >> >> >> >> Thanks for running more tests, Michael. >> >> It is encouraging that you saw a similar performance between 9.3 and 9.4. >> >> I will also run more tests with different parameters. >> >> >> >> On Tue, Sep 13, 2022 at 9:30 AM Michael Sokolov <msoko...@gmail.com> >> >> wrote: >> >>> >> >>> As a follow-up, I ran a test using the same parameters as above, only >> >>> changing M=200 to M=16. This did result in a single segment in both >> >>> cases (9.3, 9.4) and the performance was pretty similar; within noise >> >>> I think. The main difference I saw was that the 9.3 index was written >> >>> using CFS: >> >>> >> >>> 9.4: >> >>> recall latency nDoc fanout maxConn beamWidth visited index ms >> >>> 0.755 1.36 1000000 100 16 100 200 891402 1.00 >> >>> post-filter >> >>> -rw-r--r-- 1 sokolovm amazon 382M Sep 13 13:06 >> >>> _0_Lucene94HnswVectorsFormat_0.vec >> >>> -rw-r--r-- 1 sokolovm amazon 262K Sep 13 13:06 >> >>> _0_Lucene94HnswVectorsFormat_0.vem >> >>> -rw-r--r-- 1 sokolovm amazon 131M Sep 13 13:06 >> >>> _0_Lucene94HnswVectorsFormat_0.vex >> >>> >> >>> 9.3: >> >>> recall latency nDoc fanout maxConn beamWidth visited index ms >> >>> 0.775 1.34 1000000 100 16 100 4033 977043 >> >>> rw-r--r-- 1 sokolovm amazon 297 Sep 13 13:26 _0.cfe >> >>> -rw-r--r-- 1 sokolovm amazon 516M Sep 13 13:26 _0.cfs >> >>> -rw-r--r-- 1 sokolovm amazon 340 Sep 13 13:26 _0.si >> >>> >> >>> On Tue, Sep 13, 2022 at 8:50 AM Michael Sokolov <msoko...@gmail.com> >> >>> wrote: >> >>> > >> >>> > I ran another test. I thought I had increased the RAM buffer size to >> >>> > 8G and heap to 16G. However I still see two segments in the index that >> >>> > was created. And looking at the infostream I see: >> >>> > >> >>> > dir=MMapDirectory@/local/home/sokolovm/workspace/knn-perf/glove-100-angular.hdf5-train-200-200.index >> >>> > lockFactory=org\ >> >>> > .apache.lucene.store.NativeFSLockFactory@4466af20 >> >>> > index= >> >>> > version=9.4.0 >> >>> > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer >> >>> > ramBufferSizeMB=8000.0 >> >>> > maxBufferedDocs=-1 >> >>> > ... >> >>> > perThreadHardLimitMB=1945 >> >>> > ... >> >>> > DWPT 0 [2022-09-13T02:42:53.329404950Z; main]: flush postings as >> >>> > segment _6 numDocs=555373 >> >>> > IW 0 [2022-09-13T02:42:53.330671171Z; main]: 0 msec to write norms >> >>> > IW 0 [2022-09-13T02:42:53.331113184Z; main]: 0 msec to write docValues >> >>> > IW 0 [2022-09-13T02:42:53.331320146Z; main]: 0 msec to write points >> >>> > IW 0 [2022-09-13T02:42:56.424195657Z; main]: 3092 msec to write vectors >> >>> > IW 0 [2022-09-13T02:42:56.429239944Z; main]: 4 msec to finish stored >> >>> > fields >> >>> > IW 0 [2022-09-13T02:42:56.429593512Z; main]: 0 msec to write postings >> >>> > and finish vectors >> >>> > IW 0 [2022-09-13T02:42:56.430309031Z; main]: 0 msec to write fieldInfos >> >>> > DWPT 0 [2022-09-13T02:42:56.431721622Z; main]: new segment has 0 >> >>> > deleted docs >> >>> > DWPT 0 [2022-09-13T02:42:56.431921144Z; main]: new segment has 0 >> >>> > soft-deleted docs >> >>> > DWPT 0 [2022-09-13T02:42:56.435738086Z; main]: new segment has no >> >>> > vectors; no norms; no docValues; no prox; freqs >> >>> > DWPT 0 [2022-09-13T02:42:56.435952356Z; main]: >> >>> > flushedFiles=[_6_Lucene94HnswVectorsFormat_0.vec, _6.fdm, _6.fdt, _6_\ >> >>> > Lucene94HnswVectorsFormat_0.vem, _6.fnm, _6.fdx, >> >>> > _6_Lucene94HnswVectorsFormat_0.vex] >> >>> > DWPT 0 [2022-09-13T02:42:56.436121861Z; main]: flushed codec=Lucene94 >> >>> > DWPT 0 [2022-09-13T02:42:56.437691468Z; main]: flushed: segment=_6 >> >>> > ramUsed=1,945.002 MB newFlushedSize=1,065.701 MB \ >> >>> > docs/MB=521.134 >> >>> > >> >>> > so I think it's this perThreadHardLimit that is triggering the >> >>> > flushes? TBH this isn't something I had seen before; but the docs say: >> >>> > >> >>> > /** >> >>> > * Expert: Sets the maximum memory consumption per thread triggering >> >>> > a forced flush if exceeded. A >> >>> > * {@link DocumentsWriterPerThread} is forcefully flushed once it >> >>> > exceeds this limit even if the >> >>> > * {@link #getRAMBufferSizeMB()} has not been exceeded. This is a >> >>> > safety limit to prevent a {@link >> >>> > * DocumentsWriterPerThread} from address space exhaustion due to >> >>> > its internal 32 bit signed >> >>> > * integer based memory addressing. The given value must be less >> >>> > that 2GB (2048MB) >> >>> > * >> >>> > * @see #DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB >> >>> > */ >> >>> > >> >>> > On Mon, Sep 12, 2022 at 6:28 PM Michael Sokolov <msoko...@gmail.com> >> >>> > wrote: >> >>> > > >> >>> > > Hi Mayya, thanks for persisting - I think we need to wrestle this to >> >>> > > the ground for sure. In the test I ran, RAM buffer was the default >> >>> > > checked in, which is weirdly: 1994MB. I did not specifically set heap >> >>> > > size. I used maxConn/M=200. I'll try with larger buffer to see if I >> >>> > > can get 9.4 to produce a single segment for the same test settings. I >> >>> > > see you used a much smaller M (16), which should have produced quite >> >>> > > small graphs, and I agree, should have been a single segment. Were >> >>> > > you >> >>> > > able to verify the number of segments? >> >>> > > >> >>> > > Agree that decrease in recall is not expected when more segments are >> >>> > > produced. >> >>> > > >> >>> > > On Mon, Sep 12, 2022 at 1:51 PM Mayya Sharipova >> >>> > > <mayya.sharip...@elastic.co.invalid> wrote: >> >>> > > > >> >>> > > > Hello Michael, >> >>> > > > Thanks for checking. >> >>> > > > Sorry for bringing this up again. >> >>> > > > First of all, I am ok with proceeding with the Lucene 9.4 release >> >>> > > > and leaving the performance investigations for later. >> >>> > > > >> >>> > > > I am interested in what's the maxConn/M value you used for your >> >>> > > > tests? What was the heap memory and the size of the RAM buffer for >> >>> > > > indexing? >> >>> > > > Usually, when we have multiple segments, recall should increase, >> >>> > > > not decrease. But I agree that with multiple segments we can see a >> >>> > > > big drop in QPS. >> >>> > > > >> >>> > > > Here is my investigation with detailed output of the performance >> >>> > > > difference between 9.3 and 9.4 releases. In my tests I used a >> >>> > > > large indexing buffer (2Gb) and large heap (5Gb) to end up with a >> >>> > > > single segment for both 9.3 and 9.4 tests, but still see a big >> >>> > > > drop in QPS in 9.4. >> >>> > > > >> >>> > > > Thank you. >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > On Fri, Sep 9, 2022 at 12:21 PM Alan Woodward >> >>> > > > <romseyg...@gmail.com> wrote: >> >>> > > >> >> >>> > > >> Done. Thanks! >> >>> > > >> >> >>> > > >> > On 9 Sep 2022, at 16:32, Michael Sokolov <msoko...@gmail.com> >> >>> > > >> > wrote: >> >>> > > >> > >> >>> > > >> > Hi Alan - I checked out the interval queries patch; seems >> >>> > > >> > pretty safe, >> >>> > > >> > please go ahead and port to 9.4. Thanks! >> >>> > > >> > >> >>> > > >> > Mike >> >>> > > >> > >> >>> > > >> > On Fri, Sep 9, 2022 at 10:41 AM Alan Woodward >> >>> > > >> > <romseyg...@gmail.com> wrote: >> >>> > > >> >> >> >>> > > >> >> Hi Mike, >> >>> > > >> >> >> >>> > > >> >> I’ve opened https://github.com/apache/lucene/pull/11760 as a >> >>> > > >> >> small bug fix PR for a problem with interval queries. Am I OK >> >>> > > >> >> to port this to the 9.4 branch? >> >>> > > >> >> >> >>> > > >> >> Thanks, Alan >> >>> > > >> >> >> >>> > > >> >> On 2 Sep 2022, at 20:42, Michael Sokolov <msoko...@gmail.com> >> >>> > > >> >> wrote: >> >>> > > >> >> >> >>> > > >> >> NOTICE: >> >>> > > >> >> >> >>> > > >> >> Branch branch_9_4 has been cut and versions updated to 9.5 on >> >>> > > >> >> stable branch. >> >>> > > >> >> >> >>> > > >> >> Please observe the normal rules: >> >>> > > >> >> >> >>> > > >> >> * No new features may be committed to the branch. >> >>> > > >> >> * Documentation patches, build patches and serious bug fixes >> >>> > > >> >> may be >> >>> > > >> >> committed to the branch. However, you should submit all >> >>> > > >> >> patches you >> >>> > > >> >> want to commit to Jira first to give others the chance to >> >>> > > >> >> review >> >>> > > >> >> and possibly vote against the patch. Keep in mind that it is >> >>> > > >> >> our >> >>> > > >> >> main intention to keep the branch as stable as possible. >> >>> > > >> >> * All patches that are intended for the branch should first be >> >>> > > >> >> committed >> >>> > > >> >> to the unstable branch, merged into the stable branch, and >> >>> > > >> >> then into >> >>> > > >> >> the current release branch. >> >>> > > >> >> * Normal unstable and stable branch development may continue >> >>> > > >> >> as usual. >> >>> > > >> >> However, if you plan to commit a big change to the unstable >> >>> > > >> >> branch >> >>> > > >> >> while the branch feature freeze is in effect, think twice: >> >>> > > >> >> can't the >> >>> > > >> >> addition wait a couple more days? Merges of bug fixes into the >> >>> > > >> >> branch >> >>> > > >> >> may become more difficult. >> >>> > > >> >> * Only Jira issues with Fix version 9.4 and priority "Blocker" >> >>> > > >> >> will delay >> >>> > > >> >> a release candidate build. >> >>> > > >> >> >> >>> > > >> >> --------------------------------------------------------------------- >> >>> > > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> > > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> > > >> >> >> >>> > > >> >> >> >>> > > >> > >> >>> > > >> > --------------------------------------------------------------------- >> >>> > > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> > > >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> > > >> > >> >>> > > >> >> >>> > > >> >> >>> > > >> --------------------------------------------------------------------- >> >>> > > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> > > >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> > > >> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> >> > >> > >> > -- >> > Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > > -- > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org