kaivalnp commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3313836825
Thank you for your inputs @mikemccand @benwtrent :) > Let's get your luceneutil changes merged -- this is useful for benchmarking I had it in a shared branch earlier, opened https://github.com/mikemccand/luceneutil/pull/468 now! > If we dedup only within one doc, it is (maybe?) simpler than dedup across docs? It is (probably) simpler API-wise, perhaps less work at indexing, but larger indexes > or maybe they are not realizing they have dups Dups inside Lucene could occur one of three ways: 1. The inputs to the model generating vectors is the same across two documents (even though other fields are different) 2. The model generates the same vector for two different inputs (sort of like a hash collision) 3. The user knowingly passes the same vector to a new field for "index-time filtering" I guess users can figure out (1) on their own, (2) is much less likely, and I'm hoping users find some value in (3) with the above benchmark change :) > vectors across all fields being represented in one big chunk FWIW we'd still create multiple chunks, with each chunk containing "homogeneous" vectors (i.e. ones where duping is even possible). In this whole idea, locality benefits can be improved if maximum number "homogeneous" groups are created, with each group having the minimum number of fields that will _actually_ share vectors. I don't know how to _best_ determine these groups -- the obvious parameters to include are: encoding (`byte` / `float32`) and dimensions. Perhaps we ask the user directly, but this would create some kind of dependency b/w fields that does not exist today. To avoid this complication -- I was proposing that Lucene "automatically" de-dup based on the above parameters -- but suggestions welcome here. At the end, we just want separate HNSW graphs created for filters known at index-time :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
