mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1190803247
@jpountz I have run another set of benchmarks on dataset **sift-128-euclidean M:16 efConstruction:100 with index sort on SortField.Type.LONG**, where I added an extra index sort field: `NumericDocValuesField` with random long values. Observed results: - the whole indexing + flush is slightly faster on the candidate (548s sec in candidate VS 654s in baseline) - baseline: indexing is fast, but flush takes 653 sec - candidate: indexing takes most time, and flush is very fast - 3 sec Comparison with [unsorted case](https://github.com/apache/lucene/pull/992#issuecomment-1178060346) that was done before: - baseline: indexing time increased from 533s sec to 654s - candidate: indexing time increased from 538s sec to 548s - in particular, reconstructing the graph using new ordinals doesn't seem to take much time: 866 ms or 0.8 s **Baseline** ```bash IW 0 [2022-07-20T21:00:49.727575Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 1000000 documents; now flush IW 0 [2022-07-20T21:00:51.099538Z; main]: now flush at close IW 0 [2022-07-20T21:00:51.100162Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-20T21:00:51.100936Z; main]: index before flush DW 0 [2022-07-20T21:00:51.101006Z; main]: startFullFlush DW 0 [2022-07-20T21:00:51.107445Z; main]: anyChanges? numDocsInRam=1000000 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-20T21:00:51.119428Z; main]: flush postings as segment _3 numDocs=1000000 IW 0 [2022-07-20T21:00:51.715470Z; main]: 0 msec to write norms IW 0 [2022-07-20T21:00:51.852081Z; main]: 136 msec to write docValues IW 0 [2022-07-20T21:00:51.852305Z; main]: 0 msec to write points HNSW 0 [2022-07-20T21:00:53.264684Z; main]: build graph from 1000000 vectors HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 990000 in 7288/641320 ms HNSW 0 [2022-07-20T21:11:34.590292Z; main]: built 990000 in 7288/641320 ms IW 0 [2022-07-20T21:11:42.662461Z; main]: 650804 msec to write vectors IW 0 [2022-07-20T21:11:43.334377Z; main]: 671 msec to finish stored fields IW 0 [2022-07-20T21:11:43.334611Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-20T21:11:43.336506Z; main]: 0 msec to write fieldInfos DWPT 0 [2022-07-20T21:11:44.244388Z; main]: flush time 653120.381917 msec IW 0 [2022-07-20T21:11:44.247650Z; main]: publishFlushedSegment _3(10.0.0):c1000000:[indexSort=<long: "sortkey">]:... Indexed 1000000 documents in 654s ``` **Candidate** ```bash IW 0 [2022-07-20T18:35:41.879858Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Done indexing 1000000 documents; now flush IW 0 [2022-07-20T18:44:46.109074Z; main]: now flush at close IW 0 [2022-07-20T18:44:46.109804Z; main]: start flush: applyAllDeletes=true IW 0 [2022-07-20T18:44:46.110587Z; main]: index before flush DW 0 [2022-07-20T18:44:46.110689Z; main]: startFullFlush DW 0 [2022-07-20T18:44:46.115672Z; main]: anyChanges? numDocsInRam=1000000 deletes=false hasTickets:false pendingChangesInFullFlush: false DWPT 0 [2022-07-20T18:44:46.126626Z; main]: flush postings as segment _2 numDocs=1000000 IW 0 [2022-07-20T18:44:46.741747Z; main]: 0 msec to write norms IW 0 [2022-07-20T18:44:46.864200Z; main]: 121 msec to write docValues IW 0 [2022-07-20T18:44:46.864364Z; main]: 0 msec to write points IndexWriter 0 [2022-07-20T18:44:47.609637Z; main]: starting reconstructing graph ordinals 63362025298959 IndexWriter 0 [2022-07-20T18:44:48.476035Z; main]: finished reconstructing graph ordinals 63362892156709 IW 0 [2022-07-20T18:44:48.481920Z; main]: 1617 msec to write vectors IW 0 [2022-07-20T18:44:49.166673Z; main]: 683 msec to finish stored fields IW 0 [2022-07-20T18:44:49.167432Z; main]: 0 msec to write postings and finish vectors IW 0 [2022-07-20T18:44:49.174701Z; main]: 6 msec to write fieldInfos IFD 0 [2022-07-20T18:44:50.072852Z; main]: now checkpoint "_2(10.0.0):c1000000:[indexSort=<long: "sortkey">]:.. DWPT 0 [2022-07-20T18:44:50.058801Z; main]: flush time 3931.69475 msec Indexed 1000000 documents in 548s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org