[
https://issues.apache.org/jira/browse/SOLR-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078614#comment-18078614
]
Sanjay Dutt commented on SOLR-18207:
------------------------------------
The proposed Lucene data-blind quantization feature
(https://github.com/apache/lucene/issues/16029) would allow codecs to drop raw
float vectors internally, trading storage savings for the inability to
re-quantize from the original input. Although this feature is not available
yet, I am trying to evaluate whether it is still valuable to add a Solr-side
option to disable storing raw vectors as StoredFields.
Today, such a Solr option can reduce duplicate storage because Lucene may still
preserve raw vectors internally. However, with data-blind quantization, that
assumption may no longer hold. If Lucene drops raw vectors internally and Solr
also disables StoredField storage, then raw vector retrieval from Solr would no
longer be available.
So the Solr feature may still be useful, but it probably needs to clearly
define whether raw vectors are preserved somewhere, or whether retrieval is
intentionally unsupported.
> Add derived stored retrieval for DenseVectorField to avoid duplicate vector
> storage
> -----------------------------------------------------------------------------------
>
> Key: SOLR-18207
> URL: https://issues.apache.org/jira/browse/SOLR-18207
> Project: Solr
> Issue Type: Task
> Components: vector-search
> Reporter: Sanjay Dutt
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Solr DenseVectorField currently stores vector data twice when stored="true":
> once in Lucene’s vector index for kNN/search and again in stored fields for
> retrieval. This increases index size significantly for large vector workloads.
> This change adds an opt-in mode for DenseVectorField that preserves
> stored-field semantics for normal document retrieval while avoiding the
> redundant stored-field copy of the vector payload. Instead, Solr reconstructs
> the returned vector value from Lucene vector data at fetch time.
> Key points:
> * Adds an opt-in field type/property for derived vector retrieval.
> * Avoids writing redundant stored vector bytes at index time.
> * Extends document fetch to materialize vector values from Lucene vector
> readers.
> * Keeps existing behavior unchanged unless the new option is enabled.
> * Documents the fetch-time tradeoff and recommends caution for hot paths
> that return vectors frequently, especially fl=*.
> {code:java}
> <fieldType name="knn_vector_derived"
> class="solr.DenseVectorField"
> vectorDimension="1024"
> similarityFunction="cosine"
> knnAlgorithm="hnsw"
> indexed="true"
> useVectorValuesAsStored="true"/>{code}
> Initial scope:
> * Single-valued vector fields only.
> * Multivalued derived vector retrieval is not supported in this change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]