[ 
https://issues.apache.org/jira/browse/HIVE-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27743:
----------------------------------
    Labels: hive-ai  (was: )

> Semantic Search In Hive
> -----------------------
>
>                 Key: HIVE-27743
>                 URL: https://issues.apache.org/jira/browse/HIVE-27743
>             Project: Hive
>          Issue Type: Wish
>         Environment: *  
>            Reporter: Sreenath
>            Assignee: Sreenath
>            Priority: Major
>              Labels: hive-ai
>
> _Semantic search is the tech power *vector databases,* and we can have the 
> same power in Hive._
> Semantic search is a way for computers to understand the meaning behind words 
> and phrases when you're searching for something. Instead of just looking for 
> exact matches of keywords, it tries to figure out what you're really asking 
> and provides results that are more relevant and meaningful to your question. 
> It's like having a search engine that can understand what you mean, not just 
> what you say, making it easier to find the information you're looking for. 
> This ticket is to have Semantic search in Hive as UDFs.
> The proposal is to implement functions for on-the-fly calculation of 
> similarity distance between two values. Once we have them we could easily do 
> semantic search as part of a where clause.
>  * Eg (using a cosine similarity function): “WHERE cos_sim(region, 'europe') 
> > 0.9“. And it could return records with regions like Scandinavia, Nordic, 
> Baltic etc…
>  * We could have functions thats accept values as text or as vector 
> embeddings.
> *On the implementation side, we can have a set of new UDFs and configuration 
> properties:*
> *UDFs:*
>  # *embed(sentences[, prompt, embedding_type, normalize_embeddings])*
>  # *cos_sim(a, b)*
>  # *dot_score(a, b)*
>  # *euclidean_sim(a, b)*
>  # *manhattan_sim(a, b)*
> Additionally we can have a *llm(text)* function to use the power of a LLM.
> *Configuration properties:*
>  # hive.embedding.model - Path to a pre-trained SentenceTransformer model
>  # hive.embedding.batch_size - The batch size used for the computation
>  # hive.embedding.precision - The precision to use for the embeddings. Can be 
> “float32”, “int8”, “uint8”, “binary”, or “ubinary”
>  # hive.embedding.default_prompt - Prompt prefix that must be used by default
>  # hive.embedding.cache_folder - Path to a local folder to store models



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to