GitHub user ep9io edited a comment on the discussion: Further LLM Support

I can share the approach I was taking with the use of LangChain4J's embeddings 
and search, and see if it makes sense or helps with a direction.

The overall idea is to split the tasks into independent plugins so that they 
can be reused more broadly and leverage other plugins within the HOP's 
ecosystem.

- Create embeddings
- Persist embeddings.
- Retrieve embeddings
- Compare/search embeddings

## Plugin 1: Create the embeddings.  In other words, _transform_ text into 
number arrays.

Inputs: text field and model details
Outputs: array of numbers

This plugin selects an input field as the input value (e.g. text value) and 
passes it through the model (typically an encoder like bert).  What you get 
back is a dense set (embeddings).  They might also be called feature vectors, 
word vectors, latent space, logits.   They can look like this:

`[0.12, -0.34, 0.56, 0.78, -0.91]`  This could represent a word in a 
5-dimensional embedding space.

Your plugin already does this in the line highlighted below:
```
private void addToVector(String textValue, Object keyValue, Object[] 
lookupRowValues)
...
      TextSegment segment = TextSegment.from(textValue);
  >>> Embedding embedding = data.embeddingModel.embed(segment).content();  <<<
      String keyString = String.valueOf(keyValue);
      data.embeddingStore.add(keyString, embedding);
...
  }
```

These numbers on their own are quite useful and the output of this plugin is 
only to produce these arrays and nothing else.  Think of it as a pre-processing 
step.  

Attaching a parquet/csv output file after this plugin has many applications.  
This plugin alone can be used to create a dataset that can be used or reused 
further down the workflow.   In other words, once you create the embeddings, 
you don't need to be recreating them every time you are running a task.  The 
pipelines further down the workflow could be reading a parquet file that 
contains these embeddings.

## Plugin 2 (concept): Store the embeddings
Inputs: array of numbers
Output: Not sure, perhaps saving results or the IDs of the entries.

Your plugin implements this concept:
```
private void addToVector(String textValue, Object keyValue, Object[] 
lookupRowValues)
...
      TextSegment segment = TextSegment.from(textValue);
     Embedding embedding = data.embeddingModel.embed(segment).content();
      String keyString = String.valueOf(keyValue);
   >>> data.embeddingStore.add(keyString, embedding);  <<<
...
  }
```

Saving to a text file or parquet is one option that's already out of the box in 
HOP, and quite useful.
However, those are handy for backups, troubleshooting, or further analysis.
A more practical method is to store them into a database that's designed for 
vectors, as your plugin does already.
>From what I can see in your plugin, it uses in-memory and neo4j.  The plugin I 
>was working on was going to save to Milvus.

This plugin would be similar to existing output HOP transforms in that it 
receives inputs (vectors) and persists them somewhere designed for vectors.  
The in-memory option probably won't make much sense here, but it can in plugin 
4.

## Plugin 3 (general concept): The inverse of plugin 2, the input transform.  
Inputs: file location or address, depending on the database type.
Outputs: array of numbers.

Similar to HOP's Text File, json, parquet, DBs, it reads from a source and 
outputs the values from that source.  I haven't used Neo4J's option so not sure 
if that can be reused, but for other sources such as Milvus (or whatever vector 
database), something would need to be created.  All that is not out of the box 
in HOP (e.g. parquet, text, etc.) would fall under this plugin.

## Plugin 4: Search/Similarity/etc.

Inputs:  embedding ids, Array of numbers, model details, and perhaps searching 
parameters.
Outputs: Generated records containing the relevant matches along with their 
scores. 

This is pretty much what your following code does:

```
 private List<Object[]> getFromVector(Object[] keyRow) throws HopValueException 
{
...
    String lookupValueString = getInputRowMeta().getString(keyRow, 
data.indexOfMainField);

    Embedding queryEmbedding = 
data.embeddingModel.embed(lookupValueString).content();
    EmbeddingSearchRequest esr = new EmbeddingSearchRequest(queryEmbedding, 
data.maxResults, null, null);
    EmbeddingSearchResult<TextSegment> relevant = 
data.embeddingStore.search(esr);

    List<Object[]> retList = new ArrayList<Object[]>(data.maxResults);

    for (EmbeddingMatch<TextSegment> embeddingMatch : relevant.matches()) {
      String key = embeddingMatch.embeddingId();
      Double score = embeddingMatch.score();

      Object[] matchedRow = data.look.get(key);
      Object resultKey = matchedRow[SemanticSearchData.KEY_INDEX];
      String value = (String) matchedRow[SemanticSearchData.VALUE_INDEX];
      ...
  }
```

As for cosine similarity and t-SNE, as @usbrandon mentioned, I would leave 
those out.  Linear methods such as PCA and non-linear like t-SNE etc. are other 
forms of (pre/post) processing steps (e.g. analysis, dimensionality reduction) 
and perhaps left for another HOP plugin or an external process such as python 
to handle.   For example, those can be done against the embeddings dataset that 
was produced by plugin 1 via python.   Cosine similarity is one of the most 
common use functions in that it measures the direction of the vectors.  
However, there are other similarity search functions (euclidean, dot product, 
etc.) and I would leave that up to the vector database or external tool to 
handle.  They'll do a better job than something implemented in HOP.  For 
example, they might use the GPU for better performance (e.g. Meta's Faiss).  
However, plugin 4 could allow for extra parameters that are passed to the 
database/tool (e.g. to use a different function).

Regarding in-memory, it can be tricky for it to scale if it's large, but still 
serves a useful purpose.  Your plugin already has it and there are other forms 
as well.  I was trying to fit Meta's Faiss into HOP, but abandoned it and 
instead used it as an in-memory similarity search (using euclidean distance) 
via Python.  Plugin 4 could allow to read an input stream into memory and then 
do the searching using the values from another input stream.  Think HOP's 
stream lookup plugin does something like this in that it reads into memory one 
stream and then uses that to process the values (lookups) of another stream.

GitHub link: 
https://github.com/apache/hop/discussions/4732#discussioncomment-11718595

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to