Similar document search (MORELIKETHIS variant) using dense vectors

Jan Rygl Wed, 24 Feb 2016 01:38:53 -0800

Hello,

I would like to ask if has somebody tried/planned to implement indexing for
dense vectors. The default scoring process is suitable only for text
documents, but we would like to use/support/develop a plugin enabling to
combine/replace default index by the dense vector index for non-textual
documents.


We have documents represented by both texts and float vectors.
We would like to be able to search similar documents to a given document
using a document vector (and not to use queries like MORELIKETHIS).

There is a vector encoding to text technique, but it is not very accurate:
 * float numbers 0.0, 0.1, 0.8 for one vector position have different
distances |0.0 - 0.1| < |0.1 - 0.8|, but encoded strings don't:
'V1-0.00-0.05' ~ 'V1-0.05-0.10' ~ 'V1-0.80-0.85',
therefore we would like to search the whole dense vector in Lucene index
(using some existing vector index technique, e.g.
https://github.com/spotify/annoy).

My question is whether this functionality was tested by somebody before and
what is your opinion about implementing it. Is it technically possible to
make a plugin supporting this functionality (having another distributed
index and separate scoring function), or is it better to store the index
for dense vectors outside of Lucine?

Thank you for your insight and time,
Jimmy

Similar document search (MORELIKETHIS variant) using dense vectors

Reply via email to