[ https://issues.apache.org/jira/browse/LUCENE-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mayya Sharipova reassigned LUCENE-10194: ---------------------------------------- Assignee: Mayya Sharipova > Should IndexWriter buffer KNN vectors on disk? > ---------------------------------------------- > > Key: LUCENE-10194 > URL: https://issues.apache.org/jira/browse/LUCENE-10194 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Mayya Sharipova > Priority: Minor > > VectorValuesWriter buffers data in memory, like we do for all data structures > that are computed on flush. But I wonder if this is the right trade-off. > The use-case I have in mind is someone trying to load a dataset of vectors in > Lucene. Given that HNSW graphs are super expensive to create, we'd ideally > load that dataset into a single segment rather than many small segments that > then need to be merged together, which in-turn re-creates the HNSW graph. > Yet buffering vectors in memory is expensive. For instance assuming 256 > dimensions, each vector consumes 1kB of memory. Should we consider buffering > vectors on disk to reduce chances of having to create new segments only > because the RAM buffer is full? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org