mayya-sharipova commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1629661053
@rmuir
> Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)?
I've done the test and surprising indexing time decreased substantially. It
is almost 2 times faster to index with Lucene's defaults than with 2Gb
RamBuffer at the expense that we end up with a bigger number of segments.
- Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
- preferredBitSize=128
- Panama Vector API enabled
- vector dims: 1536
- num of docs: 2.68M
| RamBuffer Size | Indexing time | Num of segments |
|----------: |-------------:|------:|
| 16 Mb | 1877 s | 19|
| 1994 Mb | 3141s | 9 |
<details>
<summary>Details</summary>
```
WARNING: Using incubator modules: jdk.incubator.vector
Jul 10, 2023 3:35:25 P.M.
org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 20; to disable start with
-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
Jul 10, 2023 3:35:26 P.M. org.apache.lucene.util.VectorUtilPanamaProvider
<init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
_fc.fdt _v6.fnm
_vj.si _vr_Lucene95HnswVectorsFormat_0.vec
_fc.fdx _v6.si
_vj_Lucene95HnswVectorsFormat_0.vec _vr_Lucene95HnswVectorsFormat_0.vem
_fc.fnm _v6_Lucene95HnswVectorsFormat_0.vec
_vj_Lucene95HnswVectorsFormat_0.vem _vr_Lucene95HnswVectorsFormat_0.vex
_fc.si _v6_Lucene95HnswVectorsFormat_0.vem
_vj_Lucene95HnswVectorsFormat_0.vex _vs.fdm
_fc_Lucene95HnswVectorsFormat_0.vec _v6_Lucene95HnswVectorsFormat_0.vex
_vl.fdm _vs.fdt
creating index in vectors.bin-16-100.index
MS 0 [2023-07-10T14:47:25.668178Z; main]: initDynamicDefaults
maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-07-10T14:47:25.725823Z; main]: init: current segments file is
"segments";
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c
IFD 0 [2023-07-10T14:47:25.735809Z; main]: now delete 0 files: []
IFD 0 [2023-07-10T14:47:25.738456Z; main]: now checkpoint "" [0 segments ;
isCommit = false]
IFD 0 [2023-07-10T14:47:25.738587Z; main]: now delete 0 files: []
IFD 0 [2023-07-10T14:47:25.743719Z; main]: 2 ms to checkpoint
IW 0 [2023-07-10T14:47:25.744195Z; main]: init: create=true reader=null
IW 0 [2023-07-10T14:47:25.779752Z; main]:
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=16.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9,
ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
maxMergedSegmentMB=5120.0, floorSegmentMB=2.0,
forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0,
maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0
writer=org.apache.lucene.index.IndexWriter@67b467e9
IW 0 [2023-07-10T14:47:25.780320Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
FP 0 [2023-07-10T14:47:27.042597Z; main]: trigger flush:
activeBytes=16779458 deleteBytes=0 vs ramBufferMB=16.0
FP 0 [2023-07-10T14:47:27.045564Z; main]: thread state has 16779458 bytes;
docInRAM=2589
FP 0 [2023-07-10T14:47:27.049109Z; main]: 1 in-use non-flushing threads
states
DWPT 0 [2023-07-10T14:47:27.050859Z; main]: flush postings as segment _0
numDocs=2589
....
Indexed 2680961 documents in 1877s
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]