[
https://issues.apache.org/jira/browse/SOLR-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955517#comment-17955517
]
Khaled Alkhouli commented on SOLR-17679:
----------------------------------------
[~renato] Yes, a decent part of the documentation is done, though some sections
are still pending.
> Request for Documentation/Feature Improvement on Hybrid Lexical and Vector
> Search with Score Breakdown and Cutoff Logic
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-17679
> URL: https://issues.apache.org/jira/browse/SOLR-17679
> Project: Solr
> Issue Type: Improvement
> Components: search
> Affects Versions: 9.6.1
> Reporter: Khaled Alkhouli
> Priority: Minor
> Labels: hybrid-search, search, solr, vector-based-search
> Attachments: Screenshot from 2025-02-20 16-31-48.png
>
>
> Hello Apache Solr team,
> I was able to implement a hybrid search engine that combines *lexical search
> (edismax)* and *vector search (KNN-based embeddings)* within a single
> request. The idea is simple:
> * *Lexical Search* retrieves results based on text relevance.
> * *Vector Search* retrieves results based on semantic similarity.
> * *Hybrid Scoring* sums both scores, where a missing score (if a document
> appears in only one search) should be treated as zero.
> This approach is working, but *there is a critical lack of documentation* on
> how to properly return individual score components of lexical search (score1)
> vs. vector search (score2 from cosine similarity). Right now, Solr only
> returns the final combined score, but there is no clear way to see {*}how
> much of that score comes from lexical search vs. vector search{*}. This is
> essential for debugging and for fine-tuning ranking strategies.
>
> I have implemented the following logic using Python:
> {code:java}
> def hybrid_search(query, top_k=10):
> embedding = np.array(embed([query]), dtype=np.float32
> embedding = list(embedding[0])
> lxq= rf"""{{!type=edismax
> qf='text'
> q.op=OR
> tie=0.1
> bq=''
> bf=''
> boost=''
> }}({query})"""
> solr_query = {"params": {
> "q": "{!bool filter=$retrievalStage must=$rankingStage}",
> "rankingStage":
> "{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
> "retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}",
> # Union
> "normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
> "lexicalQuery": lxq,
> "vectorQuery": f"{{!knn f=all_v512 topK={top_k}}}{embedding}",
> "fl": "text",
> "rows": top_k,
> "fq": [""],
> "rq": "{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}",
> "rqq": "{!frange l=$cutoff}query($rankingStage)",
> "sort": "score desc",
> }}
> response = requests.post(SOLR_URL, headers=HEADERS, json=solr_query)
> response = response.json()
> return response {code}
> h3. *Issues & Missing Documentation*
> # *No Way to Retrieve Individual Scores in a Hybrid Search*
> There is no clear documentation on how to return:
> *
> ** The *lexical search score* separately.
> ** The *vector search score* separately.
> ** The *final combined score* (which Solr already provides).
> Right now, we’re left guessing whether the sum of these scores works as
> expected, making debugging and tuning unnecessarily difficult.
> # *No Clear Way to Implement Cutoff Logic in Solr*
> In a hybrid search, I need to filter out results that don’t meet a {*}minimum
> score threshold{*}. Right now, I have to implement this in Python, {*}which
> defeats the purpose of using Solr for ranking in the first place{*}.
> *
> ** How can we enforce a {*}score-based cutoff directly in Solr{*}, without
> external filtering?
> ** The \{!frange} function is mentioned in the documentation but lacks
> {*}clear examples on how to apply it to hybrid search{*}.
> h3. *Feature Request / Documentation Improvement*
> * *Provide a way to return individual scores for lexical and vector search
> in the response.* This should be as simple as adding fields like
> {{{}fl=score,lexical_score,vector_score{}}}.
> * *Clarify how to apply cutoff logic in a hybrid search.* This is an
> essential ranking mechanism, and yet, there’s little guidance on how to do
> this efficiently within Solr itself.
> Looking forward to a response.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]