Khaled Alkhouli created SOLR-17679:
--------------------------------------
Summary: Request for Documentation on Hybrid Lexical and Vector
Search with Score Breakdown and Cutoff Logic
Key: SOLR-17679
URL: https://issues.apache.org/jira/browse/SOLR-17679
Project: Solr
Issue Type: Task
Components: search
Affects Versions: 9.6.1
Reporter: Khaled Alkhouli
Hello Apache Solr team,
I am building a hybrid search engine that combines lexical search (traditional
keyword-based search) and vector search (semantic search using embeddings) in a
single request. I’m aiming to achieve the following in one request:
# *Lexical Search:* Using edismax with specified fields and weights.
# *Vector Search:* Using K-Nearest Neighbors (KNN) based on embeddings.
# *Hybrid Score Combination:* The final score is the sum of the normalized
lexical score and the vector search score. If a document appears in only one
search, the other score should be treated as zero.
I have implemented the following logic using Python:
{{}}
{code:java}
def hybrid_search(query, top_k=10):
embedding = np.array(embed([query]), dtype=np.float32) embedding =
list(embedding[0]) lxq = rf"""{{!type=edismax qf='all_txt'
q.op=OR tie=0.1
}}({query_terms})""" solr_query = { "params": { "q":
"{!bool filter=$retrievalStage must=$rankingStage}",
"rankingStage":
"{!func}sum(query($normalisedLexicalQuery),query($vectorQuery))",
"retrievalStage":"{!bool should=$lexicalQuery should=$vectorQuery}", # Union
"normalisedLexicalQuery": "{!func}scale(query($lexicalQuery),0,1)",
"lexicalQuery": lxq, "vectorQuery": f"{{!knn f=all_v512
topK={top_k}}}{embedding}", "fl": "post_id,all_txt,score",
"rows": top_k, "fq": [""], "rq": "{!rerank
reRankQuery=$rqq reRankDocs=100 reRankWeight=3}", "rqq": "{!frange
l=$cutoff}query($rankingStage)", "sort": "score desc",
"cutoff": f"{cutoff_ratio}" } } response =
requests.post(SOLR_URL, headers=HEADERS, json=solr_query) response =
response.json() return response
{code}
The response returns documents with a combined score, which I assume is the
addition of:
* *Lexical Search Score:* Normalized between 0 and 1.
* *Vector Search Score:* Already bounded between 0 and 1.
If a document is present in one search but not the other, the score from the
missing part is added as zero.
h3. *Request:*
I would like documentation or guidance on the following:
# *View and Return Individual Scores:*
How can I retrieve the following scores in the same request?
** Lexical search score
** Vector search score
** Final combined score (already retrieved)
I would like to display all three scores in the response together for each
document.
# *Cutoff Logic:*
I am using a Python function to calculate a cutoff threshold based on the
scores. Is it possible to implement this cutoff directly in Solr so that only
documents that pass a certain threshold are returned? If so, how can I achieve
this within Solr’s query syntax, without relying on external Python logic?
I appreciate any help or documentation that can assist with:
* Returning separate scores for lexical and vector queries.
* Implementing cutoff logic natively in Solr.
Thank you!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]