kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3313836825

   Thank you for your inputs @mikemccand @benwtrent :)
   
   > Let's get your luceneutil changes merged -- this is useful for benchmarking
   
   I had it in a shared branch earlier, opened 
https://github.com/mikemccand/luceneutil/pull/468 now!
   
   >  If we dedup only within one doc, it is (maybe?) simpler than dedup across 
docs?
   
   It is (probably) simpler API-wise, perhaps less work at indexing, but larger 
indexes
   
   > or maybe they are not realizing they have dups
   
   Dups inside Lucene could occur one of three ways:
   1. The inputs to the model generating vectors is the same across two 
documents (even though other fields are different)
   2. The model generates the same vector for two different inputs (sort of 
like a hash collision)
   3. The user knowingly passes the same vector to a new field for "index-time 
filtering"
   
   I guess users can figure out (1) on their own, (2) is much less likely, and 
I'm hoping users find some value in (3) with the above benchmark change :)
   
   > vectors across all fields being represented in one big chunk
   
   FWIW we'd still create multiple chunks, with each chunk containing 
"homogeneous" vectors (i.e. ones where duping is even possible). In this whole 
idea, locality benefits can be improved if maximum number "homogeneous" groups 
are created, with each group having the minimum number of fields that will 
_actually_ share vectors.
   
   I don't know how to _best_ determine these groups -- the obvious parameters 
to include are: encoding (`byte` / `float32`) and dimensions. Perhaps we ask 
the user directly, but this would create some kind of dependency b/w fields 
that does not exist today. To avoid this complication -- I was proposing that 
Lucene "automatically" de-dup based on the above parameters -- but suggestions 
welcome here. At the end, we just want separate HNSW graphs created for filters 
known at index-time :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to